Recipe Rating Analysis: Nutritional Values and Ingredients

1 Data Description and Preparation

1.1 Loading data

Here we load the dataset and do some cleaning and processing. We standardise the variables names and add an ID column to have a unique identifier for each recipe.

#loading the data
recipes_raw <- read.csv(here("data/epi_r.csv"))

recipes <- recipes_raw%>% 
  clean_names() %>% 
  mutate(ID = 1:nrow(.)) %>%
  select(ID, everything())

1.2 Data Description

tibble(Variables = c("**ID**", "**title**", "**rating**", "**calories**", "**protein**", "**fat**", "**sodium**", "**674 other binary variables**"), Meaning = c("Unique ID", "Recipe name", "Rating of the recipe", "Calories contained in the recipe", "Protein contained in the recipe (grams)","Fat contained in the recipe (grams)", "Sodium contained in the recipe (milligrams)", "The rest of the data is made of many binary variables, incl. ingredients, types of recipes, US States, diet preferences, etc."), Variable_Type = c("Character", "Character", "Categorical", "Numerical", "Numerical", "Numerical", "Numerical", "Binary"))%>%
  kbl()%>%
  kable_styling(position = "center")
Variables Meaning Variable_Type
ID Unique ID Character
title Recipe name Character
rating Rating of the recipe Categorical
calories Calories contained in the recipe Numerical
protein Protein contained in the recipe (grams) Numerical
fat Fat contained in the recipe (grams) Numerical
sodium Sodium contained in the recipe (milligrams) Numerical
674 other binary variables The rest of the data is made of many binary variables, incl. ingredients, types of recipes, US States, diet preferences, etc. Binary

1.3 Example of the data

selected_columns <- names(recipes)[recipes[1, ] == 1]
additional_variables <- c("title", "rating", "calories", "protein", "fat", "sodium")
selected_columns <- unique(c(additional_variables, selected_columns))

recipes[, selected_columns] %>% 
  filter(ID==1)
#>                             title rating calories protein fat sodium
#> 1 Lentil, Apple, and Turkey Wrap     2.5      426      30   7    559
#>   ID apple bean cookie fruit kid_friendly lentil lettuce sandwich
#> 1  1     1    1      1     1            1      1       1        1
#>   tomato vegetable turkey
#> 1      1         1      1

Let us focus on the first recipe as an example and take a look at the different variables of interest. We can observe the title as well as the numerical variables such as rating, calories, protein, fat and sodium. We will then see among the many categorical variables, those that are related to our recipe such as apple, bean, cookie, fruit, kid_friendly and so on.

1.4 Classifying variables into categories

Given the high amount of variables that we had (680), we decided that we needed to somewhat create categories to aggregate them and be able to use them more easily.

Specifically, here are the things we had to solve when creating categories:

  • Merge “father_s_day” and “fathers_day” same for mother’s day and for new year’s and st patrick and valentines day, vermout and vermouth
  • Check what leafy_green is and how many obs there is of it
  • Same for “legume”
  • Decide if we group beans together or leave them in vegetables
  • What do we do with “meat”, meatball and meatloaf. What about rabbit and for sausage or steak and venison
  • Put nutmeg in spices or nuts?
  • do we create a seeds category –> for example for “poppy” (put in spices for now) and for “seed” and sesame
  • We should probably create a sauce category, and a “full_meal”
  • where do we put tapioca and yuca
  • do we put buttermilk in drinks or dairy?
  • check if dorie_greenspa column exist or if it was just a typo without the N
  • check how many observations with phyllo_puff_pastry_dough
  • do we separate fish and seafood?

We have now solved all these issues and manually classified every variable of our dataset in specific categories shown below.

us_states <- c("alabama", "alaska", "arizona", "california", "colorado", "connecticut", "florida", "georgia", "hawaii", "idaho", "illinois", "indiana", "iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", "massachusetts", "michigan", "minnesota", "mississippi", "missouri", "nebraska", "new_hampshire", "new_jersey", "new_mexico", "new_york", "north_carolina", "ohio", "oklahoma", "oregon", "pennsylvania", "rhode_island", "south_carolina", "tennessee", "texas", "utah", "vermont", "virginia", "washington", "west_virginia", "wisconsin")

us_cities <- c("aspen", "atlanta", "beverly_hills", "boston","brooklyn", "buffalo", "cambridge", "chicago", "columbus", "costa_mesa", "dallas", "denver", "healdsburg", "hollywood", "houston", "kansas_city", "lancaster", "las_vegas", "london", "los_angeles", "louisville", "miami", "minneapolis", "new_orleans", "pacific_palisades", "paris", "pasadena", "pittsburgh", "portland", "providence", "san_francisco", "santa_monica", "seattle", "st_louis", "washington_d_c", "westwood", "yonkers")

countries <- c("australia", "bulgaria", "canada", "chile", "cuba", "dominican_republic", "egypt", "england", "france", "germany", "guam", "haiti", "ireland", "israel", "italy", "jamaica", "japan", "mexico", "mezcal", "peru", "philippines", "spain", "switzerland")

alcohol <- c("alcoholic", "amaretto", "beer", "bitters", "bourbon", "brandy", "calvados", "campari", "chambord", "champagne", "chartreuse", "cocktail", "cognac_armagnac", "creme_de_cacao", "digestif", "eau_de_vie", "fortified_wine", "frangelico", "gin", "grand_marnier", "grappa", "kahlua", "kirsch", "liqueur", "long_beach", "margarita", "marsala", "martini", "midori", "pernod", "port", "punch", "red_wine", "rum", "sake", "sangria", "scotch", "sherry", "sparkling_wine", "spirit", "spritzer", "tequila", "triple_sec", "vermouth", "vodka", "whiskey", "white_wine", "wine")

others <- c("bon_appetit", "bon_app_tit", "condiment_spread", "cr_me_de_cacao", "epi_ushg", "flaming_hot_summer", "frankenrecipe", "harpercollins", "house_garden", "no_meat_no_problem", "parade", "sandwich_theory", "self", "shower", "tested_improved", "windsor", "weelicious", "snack_week", "tailgating", "quick_and_healthy", "picnic", "kitchen_olympics", "house_cocktail", "hors_d_oeuvre", "frozen_dessert", "freezer_food", "edible_gift", "cookbook_critic", "cook_like_a_diner", "condiment", "cocktail_party", "camping", "buffet", "x30_days_of_groceries", "x_cakeweek", "x_wasteless", "x22_minute_meals", "x3_ingredient_recipes")

chef <- c("anthony_bourdain", "dorie_greenspan", "emeril_lagasse", "nancy_silverton", "suzanne_goin")

interesting <- c("advance_prep_required", "entertaining", "epi_loves_the_microwave", "friendsgiving", "game", "gourmet", "healthy", "high_fiber", "hot_drink", "kid_friendly", "kidney_friendly", "microwave", "no_cook", "one_pot_meal", "oscars", "paleo", "pescatarian", "poker_game_night", "potluck", "quick_easy", "cookbooks", "leftovers")

seasons_vec <- c("winter", "spring", "summer", "fall")

celebrations <- c("anniversary", "back_to_school", "bastille_day", "birthday", "christmas", "christmas_eve", "cinco_de_mayo", "date", "diwali", "easter", "engagement_party", "family_reunion", "father_s_day", "fourth_of_july", "graduation", "halloween", "hanukkah", "kentucky_derby", "kwanzaa", "labor_day", "lunar_new_year", "mardi_gras", "mother_s_day", "new_year_s_day", "new_year_s_eve", "oktoberfest", "party", "passover", "persian_new_year", "purim", "ramadan", "rosh_hashanah_yom_kippur", "shavuot", "st_patrick_s_day", "sukkot", "super_bowl", "thanksgiving", "valentine_s_day", "wedding")

drink_no_alcohol_vec <- c("apple_juice", "fruit_juice", "iced_tea", "lemon_juice", "lime_juice", "orange_juice", "pomegranate_juice", "tea")

tools <- c("coffee_grinder", "double_boiler", "food_processor", "ice_cream_machine", "juicer", "mandoline", "mixer", "mortar_and_pestle", "pasta_maker", "ramekin", "skewer", "slow_cooker", "smoker", "wok", "blender", "candy_thermometer", "pressure_cooker")

cooking_techniques <- c("raw", "saute", "freeze_chill", "fry", "stir_fry", "simmer", "boil", "broil", "bake", "braise", "chill", "deep_fry", "steam", "rub", "roast", "poach", "pan_fry", "marinate", "grill_barbecue", "grill")

nutritional_values <- c("calories", "protein", "fat", "sodium")

recipe_type_vec <- c("aperitif", "appetizer", "breakfast", "brunch", "dessert", "dinner", "lunch", "side", "snack")

diet_preferences_vec <- c("dairy_free", "fat_free", "kosher","kosher_for_passover", "low_cal", "low_cholesterol", "low_carb", "low_fat", "low_sodium", "low_sugar", "low_no_sugar", "non_alcoholic", "no_sugar_added", "organic", "peanut_free", "soy_free", "sugar_conscious", "tree_nut_free", "vegan", "vegetarian", "wheat_gluten_free")

### Ingredients
#low level categories

vegetables_vec <- c("artichoke", "arugula", "asparagus", "butternut_squash", "bean", "beet", "bell_pepper", "bok_choy", "broccoli", "broccoli_rabe", "brussel_sprout", "cabbage", "capers", "carrot", "cauliflower", "celery", "chard", "chile_pepper", "collard_greens", "corn", "cucumber", "eggplant", "endive", "escarole", "fennel", "garlic", "ginger", "green_bean", "green_onion_scallion", "horseradish", "jerusalem_artichoke", "jicama", "kale", "leafy_green", "leek", "legume", "lentil", "lettuce", "lima_bean", "mushroom", "mustard_greens", "okra", "onion", "parsnip", "pea", "pickles", "poblano", "pumpkin", "radicchio", "radish", "root_vegetable", "rutabaga", "salad", "shallot", "soy", "spinach", "squash", "sugar_snap_pea", "tapioca", "tomatillo", "tomato", "turnip", "watercress", "yellow_squash", "yuca", "zucchini")

pork_meat_vec <- c("bacon", "ham", "pork", "pork_chop", "pork_rib", "pork_tenderloin", "prosciutto")

lamb_meat_vec <- c("ground_lamb", "lamb", "lamb_chop", "lamb_shank", "rack_of_lamb")

beef_meat_vec <- c("beef", "beef_rib", "beef_shank", "beef_tenderloin", "brisket", "ground_beef", "hamburger", "veal")

meat_with_wings_vec <- c("chicken", "duck", "goose", "poultry", "poultry_sausage", "quail", "turkey")

meat_various_vec <- c("meatball", "meatloaf", "rabbit", "sausage", "steak", "venison")

# stuff_in_the_water <- c("anchovy", "bass", "caviar", "clam", "cod", "crab", "fish", "halibut", "lobster", "mussel", "octopus", "oyster", "salmon", "sardine", "scallop", "seafood", "shellfish", "shrimp", "snapper", "squid", "swordfish", "tilapia", "trout", "tuna")
  
seafood_vec <- c("clam", "crab", "lobster", "mussel", "octopus", "oyster", "scallop", "shellfish", "shrimp", "squid")

fish_vec <- c("anchovy", "bass", "caviar", "cod", "halibut", "salmon", "sardine", "snapper", "swordfish", "tilapia", "trout", "tuna")
  

herbs_vec <- c("anise", "basil", "chive", "cilantro", "coriander", "dill", "lemongrass", "mint", "oregano", "parsley", "rosemary", "sage", "tarragon", "thyme")

nuts_vec <- c("almond", "cashew", "chestnut", "hazelnut", "macadamia_nut", "peanut", "pecan", "pine_nut", "pistachio", "tree_nut", "walnut")

cereals_vec <- c("barley", "bran", "bulgur", "grains", "granola", "oat", "quinoa", "rye", "whole_wheat")

carbs_vec <- c("brown_rice", "chickpea", "cornmeal", "couscous", "hominy_cornmeal_masa", "orzo", "pasta", "potato", "rice", "semolina", "sweet_potato_yam", "wild_rice")

fruits_vec <- c("apple", "apricot", "asian_pear", "avocado", "banana", "berry", "blackberry", "blueberry", "cantaloupe", "cherry", "citrus", "coconut", "cranberry", "currant", "dried_fruit", "fig", "grape", "grapefruit", "guava", "honeydew", "kiwi", "kumquat", "lemon", "lime", "lingonberry", "lychee", "mango", "melon", "nectarine", "olive", "orange", "papaya", "passion_fruit", "peach", "pear", "persimmon", "pineapple", "plantain", "plum", "pomegranate", "prune", "quince", "raisin", "raspberry", "rhubarb", "strawberry", "tamarind", "tangerine", "tropical_fruit", "watermelon")

dessert_vec <- c("biscuit", "brownie", "butterscotch_caramel", "cake", "candy", "chocolate", "cobbler_crumble", "cookie", "cookies", "cranberry_sauce", "crepe", "cupcake", "honey", "jam_or_jelly", "maple_syrup", "marshmallow", "muffin","pancake", "pastry", "pie", "smoothie", "sorbet", "souffle_meringue", "waffle")

cheeses_vec <- c("blue_cheese", "brie", "cheddar", "cottage_cheese", "cream_cheese", "feta", "fontina", "goat_cheese", "gouda", "monterey_jack", "mozzarella", "parmesan", "ricotta", "swiss_cheese")

dairy_vec <- c("butter", "buttermilk", "custard", "egg_nog", "ice_cream", "marscarpone", "milk_cream", "sour_cream", "yogurt")

spices_vec <- c("caraway", "cardamom", "chili", "cinnamon", "clove", "cumin", "curry", "hot_pepper", "jalapeno", "marinade", "nutmeg", "paprika", "pepper", "poppy", "saffron", "sesame", "sesame_oil", "soy_sauce", "vanilla", "wasabi")

#top level categories
general_categories <- c("vegetable", "meat", "fish", "seafood", "herb", "nut", "fruit", "drink", "cheese", "dairy", "spice")#using this to select the columns in ingredients_df and we could also use it later of for the for loop
  
all_meats <- c(beef_meat_vec, pork_meat_vec, lamb_meat_vec, meat_with_wings_vec, meat_various_vec)

all_fish_seafood <- c(fish_vec, seafood_vec)

all_ingredients <- c(vegetables_vec, all_meats, all_fish_seafood, herbs_vec, nuts_vec, cereals_vec, carbs_vec, fruits_vec, drink_no_alcohol_vec, dessert_vec, cheeses_vec, dairy_vec, spices_vec, "egg")

#stuff which isn't ingredients and that we need to sort
meals <- c("backyard_bbq", "bread", "breadcrumbs", "brine", "burrito", "casserole_gratin", "coffee", "flat_bread", "hummus", "iced_coffee", "lasagna", "macaroni_and_cheese", "mayonnaise", "mustard", "noodle", "oatmeal", "omelet",  "peanut_butter", "pizza", "pot_pie", "potato_salad", "quiche", "rose", "salad_dressing", "salsa", "sandwich", "sauce", "seed", "soup_stew", "stew", "stock", "stuffing_dressing", "taco", "tart", "tofu", "tortillas", "vinegar", "frittata", "molasses", "sourdough", "fritter", "phyllo_puff_pastry_dough", "dip")
#whole list of stuff to remove for now to be able to sort 
to_remove_temp <- c(us_states, us_cities, countries, alcohol, wtf, chef, interesting, season, celebrations, tools, cooking_techniques, nutritional_values, repice_type, diet_preferences, all_ingredients, to_sort)

#tried this with select but didn't work because some columns in the vector don't exist in the dataset
recipes_to_filter <- recipes[, !(colnames(recipes) %in% to_remove_temp)]

#creates a tibble with one column with all the colnames to be able to sort ingredients
names <- recipes_to_filter %>% colnames() %>% as_tibble()

# recipes %>% 
#   select(iceland)
  # filter(x_cakeweek==1)

#checked if some values in our categories weren't columns in recipes --> 19 of them weren't so I deleted them from the vectors
checking <- tibble(to_remove_temp[!to_remove_temp %in% colnames(recipes)])
checking


#checking only for ingredients because I feel like there is one in the ingredient vector which is not a column name
checking_ing <- tibble(all_ingredients[!all_ingredients %in% colnames(recipes)])
checking_ing

#no name comes out so the only logical conclusion is that there is a duplicated ingrediant name --> it was arugula
all_ingredients[duplicated(all_ingredients)]

2 Data Understanding

2.1 Structure and summary

# Now let's see the structure of our data
recipes %>% 
  head(20) %>% 
  str() 
#> 'data.frame':    20 obs. of  681 variables:
#>  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ title                   : chr  "Lentil, Apple, and Turkey Wrap "..
#>  $ rating                  : num  2.5 4.38 3.75 5 3.12 ...
#>  $ calories                : num  426 403 165 NA 547 948 NA NA 170 ..
#>  $ protein                 : num  30 18 6 NA 20 19 NA NA 7 23 ...
#>  $ fat                     : num  7 23 7 NA 32 79 NA NA 10 41 ...
#>  $ sodium                  : num  559 1439 165 NA 452 ...
#>  $ x_cakeweek              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ x_wasteless             : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ x22_minute_meals        : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ x3_ingredient_recipes   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ x30_days_of_groceries   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ advance_prep_required   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ alabama                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ alaska                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ alcoholic               : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ almond                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ amaretto                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ anchovy                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ anise                   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ anniversary             : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ anthony_bourdain        : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ aperitif                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ appetizer               : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ apple                   : num  1 0 0 0 0 0 0 0 0 0 ...
#>  $ apple_juice             : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ apricot                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ arizona                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ artichoke               : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ arugula                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ asian_pear              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ asparagus               : num  0 0 0 0 0 0 1 0 0 0 ...
#>  $ aspen                   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ atlanta                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ australia               : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ avocado                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ back_to_school          : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ backyard_bbq            : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bacon                   : num  0 0 0 0 0 1 0 0 0 0 ...
#>  $ bake                    : num  0 1 0 0 1 0 0 0 0 0 ...
#>  $ banana                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ barley                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ basil                   : num  0 0 0 0 0 1 0 0 0 0 ...
#>  $ bass                    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bastille_day            : num  0 1 0 0 0 0 0 0 0 0 ...
#>  $ bean                    : num  1 0 0 0 0 0 0 0 0 0 ...
#>  $ beef                    : num  0 0 0 0 0 0 0 0 1 0 ...
#>  $ beef_rib                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ beef_shank              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ beef_tenderloin         : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ beer                    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ beet                    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bell_pepper             : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ berry                   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ beverly_hills           : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ birthday                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ biscuit                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bitters                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ blackberry              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ blender                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ blue_cheese             : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ blueberry               : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ boil                    : num  0 0 0 0 0 0 1 0 0 0 ...
#>  $ bok_choy                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bon_appetit             : num  0 1 0 1 1 1 1 0 0 0 ...
#>  $ bon_app_tit             : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ boston                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bourbon                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ braise                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bran                    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brandy                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bread                   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ breadcrumbs             : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ breakfast               : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brie                    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brine                   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brisket                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ broccoli                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ broccoli_rabe           : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ broil                   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brooklyn                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brown_rice              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brownie                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brunch                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ brussel_sprout          : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ buffalo                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ buffet                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bulgaria                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ bulgur                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ burrito                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ butter                  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ buttermilk              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ butternut_squash        : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ butterscotch_caramel    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ cabbage                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ cake                    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ california              : num  0 0 0 0 1 0 0 0 0 0 ...
#>  $ calvados                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ cambridge               : num  0 0 0 0 0 0 0 0 0 0 ...
#>   [list output truncated]
# We have only numerical variables, but in reality just 4 variables could be considered as such. More in particular, "rating", "calories", "protein", "fat" and "sodium" could be considered numerical. The other variables should be considered categorical since they allow only for 0 or 1 values.


# Let's have a different look at the data with the summary function.
recipes %>% 
  select(rating, calories, protein, fat, sodium) %>% 
  dfSummary(style = "grid")
#> Data Frame Summary  
#> recipes  
#> Dimensions: 20052 x 5  
#> Duplicates: 5576  
#> 
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | No | Variable  | Stats / Values            | Freqs (% of Valid)   | Graph     | Valid    | Missing |
#> +====+===========+===========================+======================+===========+==========+=========+
#> | 1  | rating    | Mean (sd) : 3.7 (1.3)     | 0.00 : 1836 ( 9.2%)  | I         | 20052    | 0       |
#> |    | [numeric] | min < med < max:          | 1.25 :  164 ( 0.8%)  |           | (100.0%) | (0.0%)  |
#> |    |           | 0 < 4.4 < 5               | 1.88!:  124 ( 0.6%)  |           |          |         |
#> |    |           | IQR (CV) : 0.6 (0.4)      | 2.50 :  532 ( 2.7%)  |           |          |         |
#> |    |           |                           | 3.12!: 1489 ( 7.4%)  | I         |          |         |
#> |    |           |                           | 3.75 : 5169 (25.8%)  | IIIII     |          |         |
#> |    |           |                           | 4.38!: 8019 (40.0%)  | IIIIIII   |          |         |
#> |    |           |                           | 5.00 : 2719 (13.6%)  | II        |          |         |
#> |    |           |                           | ! rounded            |           |          |         |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | 2  | calories  | Mean (sd) : 6323 (359046) | 1858 distinct values | :         | 15935    | 4117    |
#> |    | [numeric] | min < med < max:          |                      | :         | (79.5%)  | (20.5%) |
#> |    |           | 0 < 331 < 30111218        |                      | :         |          |         |
#> |    |           | IQR (CV) : 388 (56.8)     |                      | :         |          |         |
#> |    |           |                           |                      | :         |          |         |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | 3  | protein   | Mean (sd) : 100.2 (3840)  | 282 distinct values  | :         | 15890    | 4162    |
#> |    | [numeric] | min < med < max:          |                      | :         | (79.2%)  | (20.8%) |
#> |    |           | 0 < 8 < 236489            |                      | :         |          |         |
#> |    |           | IQR (CV) : 24 (38.3)      |                      | :         |          |         |
#> |    |           |                           |                      | :         |          |         |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | 4  | fat       | Mean (sd) : 346.9 (20456) | 326 distinct values  | :         | 15869    | 4183    |
#> |    | [numeric] | min < med < max:          |                      | :         | (79.1%)  | (20.9%) |
#> |    |           | 0 < 17 < 1722763          |                      | :         |          |         |
#> |    |           | IQR (CV) : 26 (59)        |                      | :         |          |         |
#> |    |           |                           |                      | :         |          |         |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
#> | 5  | sodium    | Mean (sd) : 6226 (333318) | 2434 distinct values | :         | 15933    | 4119    |
#> |    | [numeric] | min < med < max:          |                      | :         | (79.5%)  | (20.5%) |
#> |    |           | 0 < 294 < 27675110        |                      | :         |          |         |
#> |    |           | IQR (CV) : 631 (53.5)     |                      | :         |          |         |
#> |    |           |                           |                      | :         |          |         |
#> +----+-----------+---------------------------+----------------------+-----------+----------+---------+
# We can already see for instance that the majority of the values of the variable "rating" are 4.38 (40% of the total). Moreover, we observe that the variables "calories", "protein", "fat" and "sodium" have roughly 20% of missing values.

3 Data Cleaning

3.1 Analysis of NAs

#plot of missing values for each variable

recipes %>% 
  select(rating, all_of(nutritional_values)) %>% 
  gg_miss_var()+
  labs(title = "Number of NA Values for Rating and the nutritional Variables")

#we use the temp_df each time we want to create a temporary df for a single analysis and we know we won't reuse that dataframe later on
temp_df <- recipes %>% 
  select(title, all_of(nutritional_values))


na_obs <- which(rowSums(is.na(temp_df)) > 0)

# subset the original dataframe to only include rows with NA values
df_na <- temp_df[na_obs, ]

# print the result
#df_na


# count the number of NAs for each row
na_count <- rowSums(is.na(df_na))

# count the frequency of NA counts
freq_table_na <- table(na_count)

freq_na <- as.data.frame(freq_table_na) %>% 
  mutate(na_count = as.character(na_count))


freq_na %>% 
  ggplot(aes(x=na_count, y=Freq)) +
  geom_bar(stat="identity") +
  xlab("Number of NAs") +
  ylab("Frequency") +
  labs(title ="Number of NAs in nutritional values per recipe", subtitle = "NAs among variables calories, protein, fat, sodium") +
  coord_flip()

recipes <- recipes %>% 
  drop_na()

Among the recipes which have NAs, we notice that many of them have 4 NAs for all the 4 nutritional values, more precisely 4117 out of 4188 recipes. Without any other information available, making an imputation to retrieve such values would not make any sense. Interestingly, we do not observe 3 contemporary NAs for recipes.

We could try to make an imputation of the 29 recipes that have only 1 NA. The same operation on the 42 recipes with 2 NAs would not deliver accurate and satisfying results. However, we believe that is not worth to make imputation of such NA values. We should not forget that the nutritional values per recipe are estimated, then making an imputation would result in a sort of estimation of an estimation. To what extent could it be reliable? We decide to eliminate recipes with NAs for nutritional values. Nutritional values represent a crucial information for our analysis.

Finally, we would still have 15864 recipes without NAs.

3.2 Eliminate recipes with rating equal to zero

rating_count <- table(recipes$rating) %>% 
  as.data.frame() %>% 
  rename(rating = Var1,
    frequency = Freq)

recipes <- recipes %>% 
  filter(rating != 0)

There are 1296 recipes which have rating equal to zero. Some of those might be unpopular, others might be too recent to have a rating. For the purpose of our analysis, we decide to eliminate these specific recipes.

We are left with 14568 observations after removing NAs and obs with a 0 rating value.

3.3 Discard copies of recipes

We want to eliminate recipes that have multiple copies. Sometimes the recipes have the same title, but nutritional values are different. This indicates that there are various ways to prepare a specific recipe. We want to keep those recipes that have the same title, but have different nutritional values.

Let’s check for instance Almond Butter Crisps, a recipe which can be found twice in the data set, with ID=1026 and ID=8908.

# recipes %>% 
#   filter(ID == 1026)

unique_recipes <- distinct(recipes, title, rating, protein, sodium, fat, calories, .keep_all = TRUE)

# unique_recipes %>% 
#   filter(ID == 8908)

recipes <- unique_recipes

Now the data set is free from useless copies. We discarded 1288 copies in total. We lose a bit less if we remove the ones without specific ingredients first, meaning that some duplicate copies don’t contain specific ingredients either.

3.4 Removing recipes without specific ingredients listed

Here we are facing a challenge regarding the general ingredient categories. Indeed, when doing computations on the binary columns, there is no issue since, whether the recipe contains specific ingredients in a category, or only a 1 in the general category, then that information is captured in the corresponding binary column.

However, if we want to compute the total number of ingredients in each category that is present in recipes, then we are facing problems. To illustrate, let’s assume that we have a recipe which contains 3 vegetables (specific columns in the vegetables_vec). In addition, for that recipe, the general column is also a 1 –> then by summing up, we get 4 ingredients when it should be 3.

Another problem is related to recipes for which the only column in a category (e.g., vegetables) that has a 1 is the general category (i.e., vegetable), and there isn’t any specific ingredient listed within the vegetable category (in vegetables_vec) –> this can lead to issues when counting the number of specific ingredients per category.

In order to decide whether we want to analyse with or without general categories, let’s see how many observations would remain if we remove all the obs for which we have a general category at 1, and all specific ingredients in that category is set to 0.

#this filters out the observations which have 1 for general category and 0s for every ingredient in that category
recipes <- recipes %>%
  filter(!(if_all(all_of(vegetables_vec), ~.x == 0) & vegetable == 1)) %>% 
  filter(!(if_all(all_of(all_meats), ~.x == 0) & meat == 1)) %>% 
  filter(!(if_all(all_of(fish_vec), ~.x == 0) & fish == 1)) %>% 
  filter(!(if_all(all_of(seafood_vec), ~.x == 0) & seafood == 1)) %>% 
  filter(!(if_all(all_of(herbs_vec), ~.x == 0) & herb == 1)) %>% 
  filter(!(if_all(all_of(nuts_vec), ~.x == 0) & nut == 1)) %>% 
  filter(!(if_all(all_of(fruits_vec), ~.x == 0) & fruit == 1)) %>% 
  filter(!(if_all(all_of(drink_no_alcohol_vec), ~.x == 0) & drink == 1)) %>% 
  filter(!(if_all(all_of(cheeses_vec), ~.x == 0) & cheese == 1)) %>% 
  filter(!(if_all(all_of(dairy_vec), ~.x == 0) & dairy == 1)) %>% 
  filter(!(if_all(all_of(spices_vec), ~.x == 0) & spice == 1))

We are left with 10321 obs after removing all recipes which have no specific ingredient in at least one category, while that category general variable is at 1.

4 EDA

4.1 Nutrution EDA

4.1.1 Visual exploration - Univariate Analysis

4.1.1.1 Rating Barplot

recipes %>% 
  ggplot(aes(x=as.factor(rating), fill=as.factor(rating) )) +  
  geom_bar( ) +
  scale_fill_manual(values = c("red4", "red3", "orangered", "orange", "gold", "greenyellow", "green3", "green4") ) +
  theme(legend.position="none") +
  scale_y_continuous(breaks=seq(0,10000,1000)) +
  labs(x = "Rating", y = "Number of recipes", 
       title = "Overview of recipes' ratings")

The data available provide ratings which are separated in 7 distinct categories, which span from 1.25 to 5. We do not forget that we decided to exclude recipes with rating equal to zero. As we can see, most of the ratings have value equal or above 3.75, more in particular we notice that most of the recipes have ratings of 4.375.

4.1.1.2 Calories - Boxplot and Histogram

recipes_plot <- recipes %>% 
  pivot_longer(cols = c(calories, protein, fat, sodium),
               names_to = "nutrition",
               values_to = "n_value")

# Calories boxplot not filtered
recipes_plot %>%
  filter(nutrition == "calories") %>% 
  ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
    theme_light() +
    theme(legend.position="none",
          plot.title = element_text(size=11)) +
  ggtitle("Boxplot of calories nutritional value") +
  xlab("") +
  ylab("Value")

recipes_plot %>% 
  filter(nutrition == "calories") %>% 
  select(title, nutrition, n_value) %>% 
  arrange(desc(n_value)) 
#> # A tibble: 10,321 x 3
#>    title                                            nutrition n_value
#>    <chr>                                            <chr>       <dbl>
#>  1 "Pear-Cranberry Mincemeat Lattice Pie "          calories   3.01e7
#>  2 "Deep-Dish Wild Blueberry Pie "                  calories   3.00e7
#>  3 "Apricot, Cranberry and Walnut Pie "             calories   1.31e7
#>  4 "Lamb Köfte with Tarator Sauce "                 calories   4.52e6
#>  5 "Rice Pilaf with Lamb, Carrots, and Raisins "    calories   4.16e6
#>  6 "Chocolate-Almond Pie "                          calories   3.36e6
#>  7 "Caramelized Apple and Pear Pie "                calories   3.36e6
#>  8 "Merguez Lamb Patties with Golden Raisin Cousco~ calories   5.45e4
#>  9 "Grilled Lamb Chops with Porcini Mustard "       calories   2.41e4
#> 10 "Braised Short Ribs with Red Wine Gravy "        calories   1.96e4
#> # i 10,311 more rows

We notice that there are recipes with more than 30’000’000 calories which are clearly outliers. It is also hard to interpret these values from the boxplot and even with a density plot we cannot extract any insight. We must then discard those values in order to continue with a meaningful analysis. There are 28 recipes which have more than 7000 calories. We consider those as extreme values which represents 0.27% of the data available. Why do they exist? It could be due to a miscalculation or to an excessive number of servings per recipe. By evaluating the usual number of calories per recipe, we decided to keep those that have a reasonable quantity, i.e., below 7000.

# Calories boxplot
df <- recipes_plot %>%
  filter(nutrition == "calories", n_value <= 7000)

boxplot_calories <- df %>% 
  ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
  geom_boxplot()+
  scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  theme(legend.position="none",
        plot.title = element_text(size=11)) +
  scale_y_continuous(breaks=seq(0,7000,500)) +
  labs(x = "", y = "Calories", title = "Boxplot of calories nutritional value filtered")


# Calories histogram
histogram_calories <- df %>% 
  ggplot(aes(x=n_value)) +
  geom_histogram(binwidth=50, fill="red3", color="red3", alpha=0.9) +
  theme(plot.title = element_text(size=11)) +
  scale_x_continuous(breaks=seq(0,10000,1000)) +
  scale_y_continuous(breaks=seq(0,1750,250)) +
  labs(x = "Count", y = "Calories", title = "Distribution of calories across all recipes")


# Calories density plot
density_calories <- df %>% 
  ggplot(aes(x=n_value)) +
  geom_density(fill="red3", color="red2", alpha=0.8) +
  theme(plot.title = element_text(size=11)) +
  scale_x_continuous(breaks=seq(0,10000,1000)) +
  ggtitle("Distribution of calories across all recipes") +
  xlab("Calories")

grid.arrange(boxplot_calories, histogram_calories, density_calories, ncol=2, nrow =2)

After filtering extreme values we can observe that most of the recipes have between 200 and 600 calories. By checking with the histogram the distribution of calories, we observe that indeed most of the recipes have less than 1000 calories.

4.1.1.3 Protein - Boxplot and Histogram

recipes_plot %>%
  filter(nutrition == "protein") %>% 
  ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
    theme_light() +
    theme(legend.position="none",
          plot.title = element_text(size=11)) +
  ggtitle("Boxplot of protein nutritional value") +
  xlab("") +
  ylab("Value")

We notice that there are recipes with more than 50’000 grams of protein which are clearly outliers. We must then discard those values in order to continue with a meaningful analysis. Otherwise from a visual point of view we could not extract any relevant information. By checking on the epicurious website recipes with protein values above 1000, we also verified that the amount of proteins was not justified. We came to that conclusion by evaluating the average values of protein per 100grams of each ingredient in the specific recipe.

recipes_plot %>%
  filter(nutrition == "protein") %>% 
  select(title, nutrition, n_value) %>% 
  arrange(desc(n_value)) 
#> # A tibble: 10,321 x 3
#>    title                                            nutrition n_value
#>    <chr>                                            <chr>       <dbl>
#>  1 "Rice Pilaf with Lamb, Carrots, and Raisins "    protein    236489
#>  2 "Pear-Cranberry Mincemeat Lattice Pie "          protein    200968
#>  3 "Deep-Dish Wild Blueberry Pie "                  protein    200210
#>  4 "Lamb Köfte with Tarator Sauce "                 protein    166471
#>  5 "Apricot, Cranberry and Walnut Pie "             protein     87188
#>  6 "Chocolate-Almond Pie "                          protein     58334
#>  7 "Caramelized Apple and Pear Pie "                protein     58324
#>  8 "Merguez Lamb Patties with Golden Raisin Cousco~ protein      2074
#>  9 "Manhattan Clam Chowder "                        protein      1625
#> 10 "Clam and Oyster Chowder "                       protein      1365
#> # i 10,311 more rows
# Proteins boxplot
boxplot_protein <- recipes_plot %>%
  filter(nutrition == "protein", n_value <= 1000) %>% 
  ggplot( aes(x=nutrition, y=n_value, fill=nutrition)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
    theme(legend.position="none",
          plot.title = element_text(size=11)) +
  scale_y_continuous(breaks=seq(0,7000,25)) +
  ggtitle("Boxplot of protein nutritional value filtered") +
  xlab("") +
  ylab("Proteins") 


# Proteins histogram
histogram_protein <- recipes_plot %>% 
  filter(nutrition == "protein", n_value <= 1000) %>% 
  ggplot(aes(x=n_value)) +
  geom_histogram(binwidth=7, fill="red3", color="red3", alpha=0.9) +
  theme(plot.title = element_text(size=15)) +
  scale_x_continuous(breaks=seq(0,1000,50)) +
  scale_y_continuous(breaks=seq(0,7000,250)) +
  ggtitle("Distribution of proteins across all recipes") +
  xlab("Proteins") +
  ylab("Count") 


# Protein density plot
density_protein <- recipes_plot %>% 
  filter(nutrition == "protein", n_value <= 1000) %>% 
  ggplot(aes(x=n_value)) +
  geom_density(fill="red3", color="red2", alpha=0.8) +
  scale_x_continuous(breaks=seq(0,1000,50)) +
  ggtitle("Distribution of proteins across all recipes") +
  xlab("Proteins")

grid.arrange(boxplot_protein, histogram_protein, density_protein, ncol=2, nrow =3)

From the boxplot, we observe that most recipes have less than 30 grams of proteins. By plotting the histogram, we verify that this information is correct. We could even extend the range to 100 proteins per recipe. We assume that recipes with values above this threshold contain ingredients like meat, tuna, salmon or shrimps.

4.1.1.4 Sodium - Boxplot and Histogram

recipes_plot %>%
  filter(nutrition == "sodium") %>% 
  ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
    theme(legend.position="none",
          plot.title = element_text(size=11)) +
  ggtitle("Boxplot of sodium nutritional value") +
  xlab("") +
  ylab("Value")

We notice that there are recipes with more than 100’000 milligrams of sodium which are clearly outliers. We must then discard those values in order to continue with a meaningful analysis. By conducting further research, we realize that sodium values above 30’000 are highly suspicious.

recipes_plot %>% 
  filter(nutrition == "sodium") %>%
  select(title, nutrition, n_value) %>% 
  arrange(desc(n_value)) 
#> # A tibble: 10,321 x 3
#>    title                                           nutrition  n_value
#>    <chr>                                           <chr>        <dbl>
#>  1 "Pear-Cranberry Mincemeat Lattice Pie "         sodium    27675110
#>  2 "Deep-Dish Wild Blueberry Pie "                 sodium    27570999
#>  3 "Apricot, Cranberry and Walnut Pie "            sodium    12005810
#>  4 "Lamb Köfte with Tarator Sauce "                sodium     7540990
#>  5 "Chocolate-Almond Pie "                         sodium     3449512
#>  6 "Caramelized Apple and Pear Pie "               sodium     3449373
#>  7 "Rice Pilaf with Lamb, Carrots, and Raisins "   sodium     3134853
#>  8 "Whole Branzino Roasted in Salt "               sodium      132220
#>  9 "Red Snapper Baked in Salt with Romesco Sauce " sodium      132025
#> 10 "Scallops with Mushrooms in White-Wine Sauce "  sodium       90572
#> # i 10,311 more rows
# Sodium boxplot
boxplot_sodium <- recipes_plot %>%
  filter(nutrition == "sodium", n_value <= 30000) %>% 
  ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
    theme(legend.position="none",
          plot.title = element_text(size=11)) +
  scale_y_continuous(breaks=seq(0,30000,500)) +
  ggtitle("Boxplot of sodium nutritional value") +
  xlab("") +
  ylab("Sodium") 


# Sodium histogram
histogram_sodium <- recipes_plot %>%
  filter(nutrition == "sodium", n_value <= 30000) %>% 
  ggplot(aes(x=n_value)) +
  geom_histogram(binwidth=50, fill="red3", color="red3", alpha=0.9) +
  theme(plot.title = element_text(size=15)) +
  scale_x_continuous(breaks=seq(0,30000,1000)) +
  scale_y_continuous(breaks=seq(0,1750,250)) +
  ggtitle("Distribution of sodium across all recipes") +
  xlab("Sodium") +
  ylab("Count") 


# Sodium density plot
density_sodium <- recipes_plot %>%
  filter(nutrition == "sodium", n_value <= 30000) %>% 
  ggplot(aes(x=n_value)) +
  geom_density(fill="red3", color="red2", alpha=0.8) +
  scale_x_continuous(breaks=seq(0,30000,1000)) +
  ggtitle("Distribution of sodium across all recipes") +
  xlab("Sodium") 


grid.arrange(boxplot_sodium, histogram_sodium, density_sodium, ncol=1, nrow =3)

From the boxplot we observe that most recipes have sodium values below 750 milligrams. The histogram informs us that most of recipes have indeed less than 750 milligrams of sodium, even though we cannot exclude the presence of a good amount of recipes with sodium between 750 and 2000 milligrams.

4.1.1.5 Fat - Boxplot and Histogram

recipes_plot %>%
  filter(nutrition == "fat") %>% 
  ggplot(aes(x=nutrition, y=n_value, fill=nutrition)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
    theme_light() +
    theme(legend.position="none",
          plot.title = element_text(size=11)) +
  ggtitle("Boxplot of fat nutritional value") +
  xlab("") +
  ylab("Value")

We notice that there are recipes with more than 44’000 grams of fat which are clearly outliers. We must then discard those values in order to continue with a meaningful analysis. By checking on the epicurious website recipes with fat values above 1000, we also verified that the amount of proteins was not justified. We came to that conclusion by evaluating the average values of protein per 100grams of each ingredient in the specific recipe.

recipes_plot %>% 
  filter(nutrition == "fat") %>% 
  select(title, nutrition, n_value) %>% 
  arrange(desc(n_value)) 
#> # A tibble: 10,321 x 3
#>    title                                           nutrition n_value
#>    <chr>                                           <chr>       <dbl>
#>  1 "Pear-Cranberry Mincemeat Lattice Pie "         fat       1722763
#>  2 "Deep-Dish Wild Blueberry Pie "                 fat       1716279
#>  3 "Apricot, Cranberry and Walnut Pie "            fat        747374
#>  4 "Rice Pilaf with Lamb, Carrots, and Raisins "   fat        221495
#>  5 "Chocolate-Almond Pie "                         fat        186660
#>  6 "Caramelized Apple and Pear Pie "               fat        186642
#>  7 "Lamb Köfte with Tarator Sauce "                fat         44198
#>  8 "Grilled Lamb Chops with Porcini Mustard "      fat          2228
#>  9 "Braised Short Ribs with Red Wine Gravy "       fat          1818
#> 10 "Braised Duck Legs with Shallots and Parsnips " fat          1610
#> # i 10,311 more rows
# Fat boxplot
boxplot_fat <- recipes_plot %>%
  filter(nutrition == "fat", n_value <= 40000) %>% 
  ggplot( aes(x=nutrition, y=n_value, fill=nutrition)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
    theme_light() +
    theme(legend.position="none",
          plot.title = element_text(size=11)) +
  scale_y_continuous(breaks=seq(0,3000,100)) +
  ggtitle("Boxplot of fat nutritional value filtered") +
  xlab("") +
  ylab("Fat") 


# Fat histogram
histogram_fat <- recipes_plot %>% 
  filter(nutrition == "fat", n_value <= 40000) %>% 
  ggplot(aes(x=n_value)) +
  geom_histogram(binwidth=7, fill="red3", color="red3", alpha=0.9) +
  theme(plot.title = element_text(size=15)) +
  ggtitle("Distribution of fat across all recipes") +
  scale_x_continuous(breaks=seq(0,3000,100)) +
  scale_y_continuous(breaks=seq(0,7000,250)) +
  xlab("Fat") +
  ylab("Count") 

# Sodium density plot
density_fat<- recipes_plot %>%
  filter(nutrition == "fat", n_value <= 40000) %>% 
  ggplot(aes(x=n_value)) +
  geom_density(fill="red3", color="red2", alpha=0.8) +
  scale_x_continuous(breaks=seq(0,3000,100)) +
  ggtitle("Distribution of sodium across all recipes") +
  xlab("Fat") 

grid.arrange(boxplot_fat, histogram_fat, density_fat, ncol=3)

It is hard to interpret the boxplot. There are certain recipes which could have potentially more than 1000 or even 2000 grams of fat because of the high quantity of servings and the use of ingredients such as lamb, duck and bacon. We must then analyse the histogram to have a better overview and we notice that most recipes have fat values below 100 grams.

4.1.2 Visual exploration - Multivariate Analysis

4.1.2.1 Scatterplots of Rating-Calories

#removing 41 outliers we discovered above from the recipes df
recipes <- recipes %>% 
  filter(calories <= 7000, protein <= 1000, sodium <= 30000, fat <= 40000)


# Scatterplot of Rating-Calories
sp1 <- recipes %>% 
  ggplot(aes(x=calories, y=rating)) +
  geom_point(alpha=.5) +
  ggtitle("Scatterplot of rating against calories") +
  xlab("Calories") +
  ylab("Rating")

# Scatterplot of Rating-Protein
sp2 <- recipes %>% 
  ggplot(aes(x=protein, y=rating)) +
  geom_point(alpha=.5) +
  ggtitle("Scatterplot of rating against proteins") +
  xlab("Proteins") +
  ylab("Rating") 

# Scatterplot of Rating-Fat
sp3 <- recipes %>% 
  ggplot(aes(x=fat, y=rating)) +
  geom_point(alpha=.5) +
  ggtitle("Scatterplot of rating against fat") +
  xlab("Fat") +
  ylab("Rating") 

# Scatterplot of Rating-Sodium
sp4 <- recipes %>% 
  ggplot(aes(x=sodium, y=rating)) +
  geom_point(alpha=.5) +
  ggtitle("Scatterplot of rating against sodium") +
  xlab("Sodium") +
  ylab("Rating") 

grid.arrange(sp1, sp2, sp3, sp4, ncol=2, nrow =2)

We can observe that the recipes with more than 2000 calories tend to have a higher rating. For instance, few recipes with less than a 3 star rating have more than 2000 calories.

We can observe that the recipes with more than 125 grams of proteins tend to have a higher rating. For instance, few recipes with less than a 3 star rating have more than 125 grams of proteins.

We can observe that the recipes with more than 100 grams of fat tend to have a higher rating. For instance, few recipes with less than a 3 star rating have more than 100 grams of fat.

We can observe that the recipes with more than 5000 milligrams of sodium tend to have a higher rating. For instance, few recipes with less than a 3 star rating have more than 5000 mg of sodium.

4.1.2.2 Correlogram

corr_nutritional_values = recipes %>% 
  select(rating, calories, protein, fat, sodium) %>% 
  cor()

corrplot(corr_nutritional_values)

The previous scatterplots illuded us that there was somehow a correlation between rating and the nutritional values. This hypothesis has been refuted because the correlation against the rating is almost at zero for all the nutritional values. On the other hand we notice a strong positive correlation between calories and fat as well as between calories and proteins.

4.1.2.3 Grouped Scatter

We decide to plot together the variables which highlight a great level of correlation.

# Grouped scatter of calories and fat
recipes_plot1 <- recipes %>% 
  filter(fat <= 400, calories <= 6000) %>% 
  ggplot(aes(x=calories, y=fat, color=rating)) +
  geom_point(alpha=.5) + 
  scale_color_gradientn(colours = rainbow(5))

# Grouped scatter of calories and protein
recipes_plot2 <- recipes %>% 
  filter(protein <= 500, calories <= 6000) %>% 
  ggplot(aes(x=calories, y=protein, color=rating)) +
  geom_point(alpha=.5) + 
  scale_color_gradientn(colours = rainbow(5))

# Grouped scatter of protein and fat
recipes_plot3 <- recipes %>% 
  filter(fat <= 400, protein <= 350) %>% 
  ggplot(aes(x=protein, y=fat, color=rating)) +
  geom_point(alpha=.5) + 
  scale_color_gradientn(colours = rainbow(5))


# Grouped scatter of protein and sodium
recipes_plot4 <- recipes %>% 
  filter(sodium <= 400, protein <= 350) %>%  
  ggplot(aes(x=protein, y=sodium, color=rating)) +
  geom_point(alpha=.5) + 
  scale_color_gradientn(colours = rainbow(5))

grid.arrange(recipes_plot1, recipes_plot2, recipes_plot3, recipes_plot4, ncol=2, nrow =2)

We notice a positive correlation between fat and calories as well as between protein and calories. We also see a slightly positive correlation between fat and protein. We tried to understand to what extent the rating could have an impact on such relationships, but the number of rating above 3 is overwhelming and strongly determines the behavior of these relationships.

4.2 Ingredients EDA

4.2.1 Feature engineering

We discovered that the variable “drinks” on it’s own had only 11 observations, 4 of which also had the value “drink” equal to 1. Therefore, we decided to merge the two columns to simplify working with a single category called “drink” for all drinks.

#Creating a new dataframe with only the ID, title and the ingredients
ingredients_df <- recipes %>% 
  mutate(drink = ifelse(drink == 1 | drinks == 1, 1, 0)) %>% #merging drinks and drink
  select(ID, title, all_of(all_ingredients), rating)

4.2.1.1 Creating binary ingredients categories

ingredients_df_bin <- ingredients_df %>%
  mutate(vegetables_bin = as.numeric(if_any(all_of(vegetables_vec), ~.x == 1, na.rm = TRUE)),
         meats_bin = as.numeric(if_any(all_of(all_meats), ~.x == 1, na.rm = TRUE)),
         fish_bin = as.numeric(if_any(all_of(fish_vec), ~.x == 1, na.rm = TRUE)),
         seafood_bin = as.numeric(if_any(all_of(seafood_vec), ~.x == 1, na.rm = TRUE)),
         herbs_bin = as.numeric(if_any(all_of(herbs_vec), ~.x == 1, na.rm = TRUE)),
         nuts_bin = as.numeric(if_any(all_of(nuts_vec), ~.x == 1, na.rm = TRUE)),
         fruits_bin = as.numeric(if_any(all_of(fruits_vec), ~.x == 1, na.rm = TRUE)),
         cheese_bin = as.numeric(if_any(all_of(cheeses_vec), ~.x == 1, na.rm = TRUE)),
         dairy_bin = as.numeric(if_any(all_of(dairy_vec), ~.x == 1, na.rm = TRUE)),
         spices_bin = as.numeric(if_any(all_of(spices_vec), ~.x == 1, na.rm = TRUE)),
         cereals_bin = as.numeric(if_any(all_of(cereals_vec), ~.x == 1, na.rm = TRUE)),
         carbs_bin = as.numeric(if_any(all_of(carbs_vec), ~.x == 1, na.rm = TRUE)),
         dessert_bin = as.numeric(if_any(all_of(dessert_vec), ~.x == 1, na.rm = TRUE)),
         egg_bin = (egg)
         ) %>% 
  select(ID, title, contains("bin"), everything())

The fact that both select the same number of rows makes having general categories redundant in the dataset. They are not useful to create the binary columns, and they are also not useful to compute the total amount of ingredients in each category per recipe –> let’s just not include them in the first place

####testing if I still need to include the general category to create the binary column now that I modified the df to only include recipes with ingredients specified
# 
# #6586
# ingredients_df %>%
#   mutate(vegetables_bin = as.factor(as.numeric(if_any(c(vegetable, all_of(vegetables_vec)), ~.x == 1, na.rm = TRUE)))) %>%
#   filter(vegetables_bin == 1)
# 
# #6586
# ingredients_df %>%
#   mutate(vegetables_bin = as.factor(as.numeric(if_any(all_of(vegetables_vec), ~.x == 1, na.rm = TRUE)))) %>%
#   filter(vegetables_bin == 1)

4.2.1.2 Creating total ingredients categories

ingredients_df_total <- ingredients_df %>%
  mutate(total_ingredients = rowSums(select(., c(all_of(all_ingredients)))),
         total_vegetables = rowSums(select(., c(all_of(vegetables_vec)))),
         total_meat = rowSums(select(., c(all_of(all_meats)))),
         total_fish = rowSums(select(., c(all_of(fish_vec)))),
         total_seafood = rowSums(select(., c(all_of(seafood_vec)))),
         total_herbs = rowSums(select(., c(all_of(herbs_vec)))),
         total_nuts = rowSums(select(., c(all_of(nuts_vec)))),
         total_fruits = rowSums(select(., c(all_of(fruits_vec)))),
         total_cheese = rowSums(select(., c(all_of(cheeses_vec)))),
         total_dairy= rowSums(select(., c(all_of(dairy_vec)))),
         total_spices= rowSums(select(., c(all_of(spices_vec)))),
         total_cereals= rowSums(select(., c(all_of(cereals_vec)))),
         total_carbs = rowSums(select(., c(all_of(carbs_vec)))),
         total_dessert = rowSums(select(., c(all_of(dessert_vec))))
         ) %>% 
  select(ID, title, contains("total"), everything())

4.2.1.3 Creating ingredients_df_full

Creating “ingredients_df_full” which contains bin columns, total columns, and original ingredients columns

total_join <- ingredients_df_total %>% 
  select(ID, contains("total"))
  
ingredients_df_full <- ingredients_df_bin %>% 
  left_join(total_join) %>% 
  select(ID, title, rating, contains("bin"), contains("total"), everything())

4.2.2 “Binary” engineered ingredients categories

4.2.2.1 Frequency of ingredients - binary categories

This gives us interesting information about the frequency of each ingredient being present at least once in a recipe. As we can see, there is at least one vegetable in around 6750 recipes out of the 11380 total we have. Inversely, a very low amount of recipes contains at least one type of cereal.

#creating a vector with colnames of all the binary columns to be able to select them more easily afterwards
binary_columns <- ingredients_df_bin %>% 
  select(contains("bin")) %>% 
  colnames()

#adding binary columns to ingredients_df
total_categories <- ingredients_df_bin %>% 
  select(ID, all_of(binary_columns)) %>% 
  pivot_longer(-ID, names_to = "category", values_to = "binary_value") %>% 
  group_by(category) %>% 
  summarise(total = sum(binary_value))

#plotting the frequency of binary columns
total_categories %>%
  ggplot(aes(x=reorder(category,total), y=total, fill=total)) +
  geom_bar(stat = "identity") +
  scale_x_discrete(guide = guide_axis(n.dodge=3))+
  scale_fill_viridis() +
  labs(x = "Category", y = "Amount of recipes", title = "Total amount of recipes containing at least one ingredient in defined categories")

Most of the recipes contain vegetables, fruits, meats, herbs and carbs, which is not a surprise. The barplots below give us similar information about the amount of recipe which contain at least one ingredient in each category. The only category for which an ingredient is present at least once in more than 50% of the recipes is vegetables.

ingredients_df_bin %>% 
  select(contains("bin")) %>%
  mutate(across(everything(), as.factor)) %>% 
  plot_bar(order_bar = FALSE)

4.2.2.2 Relationship between binary ingredients categories and 7-level rating variable

4.2.2.2.1 Barplots

We now want to check the relationship between our rating variable and all the binary ingredients variables. We first plot this relationships with multilevel barplots

ingredients_df_bin %>% 
  select(contains("bin"), rating) %>%
  mutate(across(everything(), as.factor)) %>% 
  plot_bar(by = "rating", order_bar = FALSE)

4.2.2.2.2 Correlation plot

We see some somewhat strong negative correlation between vegetables and dessert, and between vegetable and fruits. This makes sense, as these ingredients are rarely found together in recipes. As a side note, we chose to classify tomato as a vegetable and strongly stand by this opinion :)

Concerning positive correlations, we see nuts and desert as highly correlated. This is probably because they go well together in sugary recipes. Additionally, egg and dairy are also slightly positively correlated. This most likely comes from patisserie recipes where eggs and dairy ingredients go hand in hand.

When looking at the correlation of the binary ingredients variables with the rating, we see that the highest negative correlation is between rating and carbs_bin which could sound a bit surprising given the assumed high popularity of pasta for example. The highest positive correlation is with fruit_bin at 0.04. We note however that all correlations between binary ingredients variables and rating are disappointingly weak.

#corr plot
ingredients_df_bin %>% 
  select(contains("bin"), rating) %>%
  plot_correlation()

# # With rating as a factor
# ingredients_df_bin %>% 
#   select(contains("bin"), rating) %>% 
#   mutate(across(rating, as.factor)) %>% 
#   plot_correlation()
# 
# #From this graph we also notice that recipes with rating at 3.75 have slightly positive relationship with carbs and vegetables, whereas recipes with rating at 5 have slightly negative relationship with the same features. We assume then that recipes containing vegetables and carbs tend to be less appreciated.

4.2.2.3 Relationship between binary ingredients categories and binary rating variable

Given the low correlation between the “binary” ingredients categories and the 7-level rating, we decided to investigate if we could find a higher correlation by transforming the variable to a binary rating. We decide to put the threshold for a “bad” or “good” rating at 4.

ingredients_df_bin %>% 
  select(contains("bin"), rating) %>%
  mutate(rating_bin = ifelse(rating>4, "good", "bad"), across(everything(), as.factor)) %>% 
  select(-rating) %>% 
  plot_bar(by = "rating_bin", order_bar = FALSE)

We now have only 2 categories: recipes with rating above 4 and recipes with ratings below 4. There is no clear relationship in those graphs either, and this confirms the correlation results that we have found above for the 7-level rating variable..

If we look at vegetables for example, we can see that the proportion of recipes with ratings above 4 is higher for recipes containing no vegetables, when compared to recipes containing at least one vegetable.

We once again plot the correlation, but this time with the binary rating variable. The results are very similar to the 7-level rating variable, with very weak correlations.

#corr plot
ingredients_df_bin %>% 
  select(contains("bin"), rating) %>%
  mutate(rating_bin = ifelse(rating>4, 1, 0)) %>% 
  select(-rating) %>% 
  plot_correlation()

4.2.3 “Total” engineered ingredients variables

#Analysis which single ingredient is present in most recipes
df <- ingredients_df %>% 
  select(-title, -rating) %>% 
  pivot_longer(-ID, names_to = "ingredient", values_to = "value")


ing_top10 <- df %>% 
  group_by(ingredient) %>% 
  summarise(total = sum(value)) %>% 
  ungroup() %>% 
  arrange(desc(total)) %>% 
  dplyr::slice(1:10)

ing_top10 %>% 
  # mutate(ingredient = fct_rev(ingredient)) %>% 
  ggplot(aes(x=reorder(ingredient, total), y=total, fill=ingredient)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis(discrete = TRUE) +
  scale_x_discrete(guide = guide_axis(n.dodge=2))+
  labs(x = "Ingredient", y = "Value", title = "Total amount of recipes containing each ingredient\nTop 10")

As we can observe, many recipes contain milk cream, onion, tomato, salad, egg and garlic. Some of these are versatile as they can be used for many different kinds of recipes. Onion and garlic are widely used for giving flavour to many dishes, whereas egg and milk cream can be used to cook salty and sweet recipes.

4.2.3.1 Correlation between total number of ingredients per category and 7-level rating variable

ingredients_df_total %>% 
  select(contains("total"), rating) %>% 
  plot_correlation()

We notice a negative relationship between the number of fruits and number of vegetables which makes sense these two kinds of ingredients are rarely combined in a recipe. The same is true for the number of vegetables related to the number of desserts. The results are similar to the previous correlogram with the binary columns. In this case we do not notice any significant relationship between rating and the other variables.

4.2.3.2 Correlation between total number of ingredients per category and binary rating variable

Results are disapointing again and in line with what we found so far. Correlations between total number of ingredients and the binary rating variable are also very weak.

#corr plot
ingredients_df_total %>% 
  select(contains("total"), rating) %>%
  mutate(rating_bin = ifelse(rating>4, 1, 0)) %>% 
  select(-rating) %>% 
  plot_correlation()

4.2.3.3 Amount of ingredients per recipe

The number of ingredients per recipe is more or less normally distributed, with a mean around 4.75.

#checking some stuff about the new ingredients table
ingredients_df_total %>% 
  select(ID, title, total_ingredients) %>% 
  ggplot(aes(x=total_ingredients)) + 
  geom_bar() + geom_vline(aes(xintercept=mean(total_ingredients)),color="red", linetype="dashed", size=1)+
   scale_x_continuous(breaks = seq(1, 12, by = 1)) +
  labs(x="Number of ingredients per recipe", y = "Recipe Count", title = "Distrubution of number of ingredients per recipe")

#we notice that 117 (no NAs and RAT0, and not duplicated) recipes have 0 ingredients, let's investigate why and how that's possible
ingredients_df_total %>% 
  filter(total_ingredients==0)

#let's pick recipe ID number 1183 which should have poppy and sesame seeds according to the title
recipes %>% 
  filter(ID == 1183) %>% 
  select_if(~ any(. == 1))
#we can see that only 3 variables are equal to 1 here
recipes %>% 
  filter(ID == 365) %>% 
  select_if(~ any(. == 1))

recipes %>% 
  filter(ID == 1089) %>% 
  select_if(~ any(. == 1))

#####
#QUESTION: do we want to keep those recipes?
#####


ingredients_df_total %>% 
  filter(total_ingredients >10)

Based on this information, we decide to eliminate those 117 observations which don’t contain any ingredients that we have classified in our vectors.

#eliminating all recipes which do not contain any ingredients that we have classified in categories --> within the "all_ingredients" vector
index_zero_ingredient <- ingredients_df_full %>% 
  filter(total_ingredients == 0) %>% pull(ID)

#removing the 117 recipes with no ingredients listed from recipes and ingredients_df_full
recipes <- recipes %>% 
  filter(!ID %in% index_zero_ingredient)

ingredients_df_full <- ingredients_df_full %>%
  filter(!total_ingredients == 0)

Besides the total amount of ingredients, let’s check the amount of ingredients per recipe for the top 3 categories in terms of ingredients frequency (i.e., vegetables, fruits, meats)

show_nveg <- ingredients_df_full %>%
  filter(vegetables_bin == 1) %>% 
  select(ID, title, total_vegetables) %>% 
  ggplot(aes(x=total_vegetables)) + 
  scale_x_continuous(breaks = seq(1, 9, by = 1)) +
  geom_bar() + geom_vline(aes(xintercept=mean(total_vegetables)),color="blue", linetype="dashed", size=1)


show_nfruit <- ingredients_df_full %>%
  filter(fruits_bin == 1) %>% 
  select(ID, title, total_fruits) %>% 
  ggplot(aes(x=total_fruits)) + 
  scale_x_continuous(breaks = seq(1, 9, by = 1)) +
  geom_bar() + geom_vline(aes(xintercept=mean(total_fruits)),color="blue", linetype="dashed", size=1)


# ingredients_df_full %>% 
#   select(ID, title, total_meat) %>% 
#   ggplot(aes(x=total_meat)) + 
#   geom_bar() + geom_vline(aes(xintercept=mean(total_meat)),color="blue", linetype="dashed", size=1)+
#   labs(x="Number of meats per recipe", y = "Recipe Count", title = "Distrubution of number of meats per recipe")

#let's try to filter by recipes which contain meat to see if my functions work
show_nmeat <- ingredients_df_full %>% 
  filter(meats_bin == 1) %>% 
  select(ID, title, total_meat) %>% 
  ggplot(aes(x=total_meat)) + 
  geom_bar() + geom_vline(aes(xintercept=mean(total_meat)),color="blue", linetype="dashed", size=1)+
  labs(x="Number of meats per recipe", y = "Recipe Count", title = "Distrubution of number of meats per recipe")

#why do we still have value in 0 meats --> it was because when creating the total meat column in ingredients_df_test I did not include the general meat category

grid.arrange(show_nveg, show_nfruit, show_nmeat, ncol=2, nrow =2)

As we can observe, most of the recipes have 1 or 2 vegetables and rarely more than 4. The same is true for fruits. Concerning the meat, it usual to see one kind of meat, but rare to see 3 or more in the same recipe.

4.3 Mixed EDA - ingredients and nutritional value

recipes_select <- recipes %>% 
  select(ID, title, rating, calories, protein, sodium, fat)

ingredients_select <- ingredients_df_total %>% 
  select(ID, all_of(contains("total")))

recipes_more <- recipes_select %>%
  left_join(ingredients_select, 
           by=c('ID'))

ingredients_bin_select <- ingredients_df_bin %>% 
  select(ID, contains("bin"))


recipes_full <- recipes_more %>%
  left_join(ingredients_bin_select, 
           by=c('ID'))

4.3.1 Correlation between nutritional values and engineered ingredients variables

recipes_full %>% 
  select(-ID) %>% 
  plot_correlation()

For instance we notice a positive correlation between proteins and meats_bin which includes all sorts of meat. Another visible positive correlation is the one between sodium and seafood_bin. We might also want to investigate the relationship between calories and carbs_bin.

4.3.2 Barplot and boxplot - Meat and Proteins

# Barplot
barplot1 <- recipes_full %>% 
  ggplot(aes(x = factor(meats_bin), y = protein)) +
  stat_summary(fun = mean, geom = "bar") +
  ggtitle("Average amount of proteins per recipe with and without meat") +
  xlab("Presence of Meat or not") +
  ylab("Protein Content in grams")

# Boxplots per different kinds of meat
recipes_general <- recipes_full %>% 
  select(ID) %>% 
  left_join(recipes, 
           by=c('ID'))

recipes_meat <- recipes_general %>% 
  select(ID, title, rating, calories, protein, fat, sodium, all_of(all_meats))

recipes_meat <- recipes_meat %>% 
  pivot_longer(cols=c("beef", "beef_rib", "beef_shank", "beef_tenderloin", "brisket", "ground_beef", "hamburger", "veal", "bacon", "ham", "pork", "pork_chop", "pork_rib", "pork_tenderloin", "prosciutto", "ground_lamb", "lamb", "lamb_chop", "lamb_shank", "rack_of_lamb", "chicken", "duck", "goose", "poultry", "poultry_sausage", "quail", "turkey", "meatball", "meatloaf", "rabbit", "sausage", "steak", "venison" ),
                    names_to='meats',
                    values_to='yes_or_no') %>% 
  filter(yes_or_no == 1)

multi_boxplot1 <- recipes_meat %>% 
  filter(protein < 450) %>%
  ggplot(aes(x=meats, y=protein, fill=meats)) +
  geom_boxplot(alpha=0.3) +
  scale_y_continuous(breaks=seq(0,7000,25)) +
  coord_flip() +
  ggtitle("Distribution of proteins per recipe according to different kinds of meat") +
  xlab("Meats") +
  ylab("Protein Content in grams") +
  theme(legend.position="none") 

# Here we want to show which kinds of meat specifically have a high level of proteins

grid.arrange(barplot1, multi_boxplot1, ncol=2, nrow =1)

It is very clear that among the recipes with meat, the average content of protein is higher than in recipes without meat. Among those that have meat we notice that goose, venison and lamb shank have the highest content of protein.

4.3.3 Barplot and boxplot - Seafood and Sodium

# Seafood and sodium
barplot2 <- recipes_full %>% 
  ggplot(aes(x = factor(seafood_bin), y = sodium)) +
  stat_summary(fun = mean, geom = "bar") +
  ggtitle("Average amount of sodium per recipe with and without seafood")  +
  xlab("Presence of Seafood or not") +
  ylab("Sodium Content in milligrams")


# Boxplots per different kinds of seafood

recipes_seafood <- recipes_general %>% 
  select(ID, title, rating, calories, protein, fat, sodium, all_of(seafood_vec))

recipes_seafood <- recipes_seafood %>% 
  pivot_longer(cols=c("clam", "crab", "lobster", "mussel", "octopus", "oyster", "scallop", "shellfish", "shrimp", "squid" ),
                    names_to='seafoods',
                    values_to='yes_or_no') %>% 
  filter(yes_or_no == 1)

multi_boxplot2 <- recipes_seafood %>% 
  filter(sodium < 10000) %>%
  ggplot(aes(x=seafoods, y=sodium, fill=seafoods)) +
  geom_boxplot(alpha=0.3) +
  scale_y_continuous(breaks=seq(0,30000,500)) +
  coord_flip() +
  ggtitle("Distribution of sodium per recipe according to different kinds of seafood") +
  xlab("Seafood") +
  ylab("Sodium Content in milligrams") +
  theme(legend.position="none") 

# Here we want to show which kinds of seafood specifically have a high level of sodium

grid.arrange(barplot2, multi_boxplot2, ncol=2, nrow =1)

In this case it is very clear that among the recipes with seafood, the average content of sodium is higher than in recipes without seafood. Among those that have seafood we notice that clams and lobsters have the highest content of sodium.

4.3.4 Barplot and boxplot - Carbs and Calories

# Carbs and calories
barplot3 <- recipes_full %>% 
  ggplot(aes(x = factor(carbs_bin), y = calories)) +
  stat_summary(fun = mean, geom = "bar") +
  ggtitle("Average amount of calories per recipe with and without carbohydrates") +
  xlab("Presence of carbohydrates or not") +
  ylab("Calories content")

# Afterwards we would also want to show which kinds of carbs specifically have a high number of calories


# Boxplots per different kinds of carbs
recipes_carbs <- recipes_general %>% 
  select(ID, title, rating, calories, protein, fat, sodium, all_of(carbs_vec))

recipes_carbs <- recipes_carbs %>% 
  pivot_longer(cols=c("brown_rice", "chickpea", "cornmeal", "couscous", "hominy_cornmeal_masa", "orzo", "pasta", "potato", "rice", "semolina", "sweet_potato_yam", "wild_rice"),
                    names_to='carbs',
                    values_to='yes_or_no') %>% 
  filter(yes_or_no == 1)

multi_boxplot3 <- recipes_carbs %>% 
  filter(sodium < 10000) %>%
  ggplot(aes(x=carbs, y=calories, fill=carbs)) +
  geom_boxplot(alpha=0.3) +
  scale_y_continuous(breaks=seq(0,7000,500)) +
  coord_flip() +
  ggtitle("Distribution of calories per recipe according to different kinds of food high in carbohydrates") +
  xlab("Carbs") +
  ylab("Calories Content") +
  theme(legend.position="none") 

grid.arrange(barplot3, multi_boxplot3, ncol=2, nrow =1)

Recipes which contain carbs register a higher average content of calories than recipes without carbs. Among those that have carbs we notice that pasta and chickpeas are the richest in calories.

4.4 Exploratory PCA Analysis

  • With all variables 680
recipes_pca <- recipes %>% 
  select(-ID, -title) %>% 
  PCA(ncp = 679, graph = FALSE)

recipes_pca
#> **Results for the Principal Component Analysis (PCA)**
#> The analysis was performed on 10163 individuals, described by 679 variables
#> *The results are available in the following objects:
#> 
#>    name               description                          
#> 1  "$eig"             "eigenvalues"                        
#> 2  "$var"             "results for the variables"          
#> 3  "$var$coord"       "coord. for the variables"           
#> 4  "$var$cor"         "correlations variables - dimensions"
#> 5  "$var$cos2"        "cos2 for the variables"             
#> 6  "$var$contrib"     "contributions of the variables"     
#> 7  "$ind"             "results for the individuals"        
#> 8  "$ind$coord"       "coord. for the individuals"         
#> 9  "$ind$cos2"        "cos2 for the individuals"           
#> 10 "$ind$contrib"     "contributions of the individuals"   
#> 11 "$call"            "summary statistics"                 
#> 12 "$call$centre"     "mean of the variables"              
#> 13 "$call$ecart.type" "standard error of the variables"    
#> 14 "$call$row.w"      "weights for the individuals"        
#> 15 "$call$col.w"      "weights for the variables"
fviz_pca_var(recipes_pca)

Hard to interpret this PCA output. The two dimensions explain together only 2.1% of the variability in the data.

fviz_contrib(recipes_pca, choice = "var", axes = 1)

fviz_contrib(recipes_pca, choice = "var", axes = 2)

Also in this case it difficult to understand which are contributing to each dimension since the dimension itself accounts for a little percentage of variability.

fviz_pca_biplot(recipes_pca) ## biplot

fviz_eig(recipes_pca, addlabels = TRUE, ncp=11)

recipe_hc <- hclust(dist(recipes[,-c(1,2)], method = "manhattan"))

recipe_clust <- cutree(recipe_hc, k = 10)

fviz_pca_biplot(recipes_pca,
             col.ind = factor(recipe_clust))

4.4.1 Exploratory PCA Analysis

  • With only the ones we believe could be useful
nutritional_df <- recipes %>% 
  select(ID, all_of(nutritional_values))

###### CAREFUL --> recipes_analysis should be of dim 10163 x 33
recipes_analysis <- ingredients_df_full %>% 
  left_join(nutritional_df, by="ID") %>% 
  mutate(across(all_of(contains("bin"))), ID = as.character(ID)) %>% 
  select(rating, all_of(nutritional_values), contains("bin"), contains("total"))
recipes_pca2 <- PCA(recipes_analysis, ncp = 33, graph = FALSE)

recipes_pca2
#> **Results for the Principal Component Analysis (PCA)**
#> The analysis was performed on 10163 individuals, described by 33 variables
#> *The results are available in the following objects:
#> 
#>    name               description                          
#> 1  "$eig"             "eigenvalues"                        
#> 2  "$var"             "results for the variables"          
#> 3  "$var$coord"       "coord. for the variables"           
#> 4  "$var$cor"         "correlations variables - dimensions"
#> 5  "$var$cos2"        "cos2 for the variables"             
#> 6  "$var$contrib"     "contributions of the variables"     
#> 7  "$ind"             "results for the individuals"        
#> 8  "$ind$coord"       "coord. for the individuals"         
#> 9  "$ind$cos2"        "cos2 for the individuals"           
#> 10 "$ind$contrib"     "contributions of the individuals"   
#> 11 "$call"            "summary statistics"                 
#> 12 "$call$centre"     "mean of the variables"              
#> 13 "$call$ecart.type" "standard error of the variables"    
#> 14 "$call$row.w"      "weights for the individuals"        
#> 15 "$call$col.w"      "weights for the variables"
fviz_pca_var(recipes_pca2)

fviz_contrib(recipes_pca2, choice = "var", axes = 1)

fviz_contrib(recipes_pca2, choice = "var", axes = 2)

fviz_pca_biplot(recipes_pca2) ## biplot

fviz_eig(recipes_pca2, addlabels = TRUE, ncp=11)

p1 <- fviz_pca_biplot(recipes_pca2, axes = 1:2) 
p2 <- fviz_pca_biplot(recipes_pca2, axes = 3:4) 
p3 <- fviz_pca_biplot(recipes_pca2, axes = 5:6) 
p4 <- fviz_pca_biplot(recipes_pca2, axes = 7:8) 
p5 <- fviz_pca_biplot(recipes_pca2, axes = 9:10) 

grid.arrange(p1, p2, p3, p4, p5, nrow = 3, ncol=2)

recipe_hc2 <- hclust(dist(recipes_analysis, method = "manhattan"))

recipe_clust2 <- cutree(recipe_hc2, k = 10)

fviz_pca_biplot(recipes_pca2,
             col.ind = factor(recipe_clust2))

4.5 Seasons, Recipe Type, and Countries EDA

4.5.1 Seasons

#Create seasons df
seasons_df <- recipes %>% 
  select(ID, rating, all_of(seasons_vec)) %>% 
  filter(if_any(all_of(seasons_vec)) == 1) %>% 
  mutate(sum_season = rowSums(across(all_of(seasons_vec))))

seasons_df %>% 
ggplot(aes(x=sum_season)) +
  geom_bar()

seasons_df %>% 
  filter(sum_season==3)
#>       ID rating winter spring summer fall sum_season
#> 1   1271  4.375      0      1      1    1          3
#> 2   1430  4.375      1      1      0    1          3
#> 3   2672  4.375      0      1      1    1          3
#> 4   2917  3.750      1      1      0    1          3
#> 5   4180  4.375      1      1      0    1          3
#> 6   7740  4.375      1      1      0    1          3
#> 7   8045  4.375      0      1      1    1          3
#> 8  10782  4.375      0      1      1    1          3
#> 9  12421  4.375      1      1      0    1          3
#> 10 12545  5.000      1      1      1    0          3
#> 11 15494  4.375      1      0      1    1          3
#> 12 16603  3.750      1      0      1    1          3
#> 13 16729  4.375      0      1      1    1          3
#> 14 17102  3.750      1      1      0    1          3
#> 15 18603  4.375      1      1      0    1          3
#> 16 18854  5.000      1      1      0    1          3
#> 17 18908  5.000      1      1      1    0          3
#> 18 19913  3.750      1      0      1    1          3
seasons_df %>% 
  filter(sum_season==4)
#>       ID rating winter spring summer fall sum_season
#> 1     54  5.000      1      1      1    1          4
#> 2   1837  3.750      1      1      1    1          4
#> 3   4733  3.125      1      1      1    1          4
#> 4   4980  3.750      1      1      1    1          4
#> 5   7341  1.250      1      1      1    1          4
#> 6   8593  5.000      1      1      1    1          4
#> 7  13759  4.375      1      1      1    1          4
#> 8  14419  3.750      1      1      1    1          4
#> 9  16448  2.500      1      1      1    1          4
#> 10 19011  3.750      1      1      1    1          4
#total of 29 recipes with either 3 or 4 --> let's discard them

#let's look a bit more closely to those with 2 seasons to see if they are next to each other or not

seasons_df %>% 
  filter(sum_season==2)
#>        ID rating winter spring summer fall sum_season
#> 1      25  3.750      1      0      0    1          2
#> 2      27  3.750      0      1      1    0          2
#> 3      57  4.375      1      0      0    1          2
#> 4      67  3.125      1      0      0    1          2
#> 5     130  4.375      1      0      0    1          2
#> 6     135  3.750      1      0      0    1          2
#> 7     153  4.375      0      0      1    1          2
#> 8     156  4.375      1      0      0    1          2
#> 9     167  4.375      1      0      0    1          2
#> 10    179  4.375      1      0      0    1          2
#> 11    238  3.750      0      0      1    1          2
#> 12    256  3.750      1      0      0    1          2
#> 13    275  3.750      0      1      1    0          2
#> 14    278  4.375      0      1      1    0          2
#> 15    368  4.375      1      0      0    1          2
#> 16    369  4.375      0      1      1    0          2
#> 17    401  3.750      1      0      0    1          2
#> 18    449  3.750      0      0      1    1          2
#> 19    501  4.375      0      1      1    0          2
#> 20    516  3.125      1      0      0    1          2
#> 21    538  5.000      1      0      0    1          2
#> 22    543  3.125      0      1      1    0          2
#> 23    558  4.375      1      0      0    1          2
#> 24    597  4.375      1      0      0    1          2
#> 25    599  4.375      1      0      0    1          2
#> 26    614  4.375      1      1      0    0          2
#> 27    618  4.375      1      0      0    1          2
#> 28    639  3.750      1      0      0    1          2
#> 29    664  3.750      1      0      0    1          2
#> 30    666  5.000      1      0      0    1          2
#> 31    682  4.375      1      1      0    0          2
#> 32    697  5.000      0      1      1    0          2
#> 33    700  4.375      0      1      1    0          2
#> 34    738  3.750      1      0      0    1          2
#> 35    802  3.750      0      0      1    1          2
#> 36    830  3.750      1      0      0    1          2
#> 37    864  5.000      1      0      0    1          2
#> 38    866  4.375      1      0      0    1          2
#> 39    890  3.125      1      0      0    1          2
#> 40    900  4.375      1      0      0    1          2
#> 41    902  4.375      1      0      0    1          2
#> 42    903  3.750      1      0      0    1          2
#> 43    917  4.375      1      0      0    1          2
#> 44    920  2.500      0      1      1    0          2
#> 45    956  4.375      1      0      0    1          2
#> 46    964  4.375      0      1      1    0          2
#> 47    986  4.375      1      0      0    1          2
#> 48   1003  4.375      1      1      0    0          2
#> 49   1060  4.375      1      0      0    1          2
#> 50   1106  4.375      0      1      1    0          2
#> 51   1134  4.375      1      0      0    1          2
#> 52   1179  3.125      1      0      0    1          2
#> 53   1207  4.375      0      1      1    0          2
#> 54   1212  4.375      0      1      1    0          2
#> 55   1268  4.375      0      1      1    0          2
#> 56   1275  3.750      0      1      1    0          2
#> 57   1290  4.375      1      0      0    1          2
#> 58   1308  3.125      0      0      1    1          2
#> 59   1345  4.375      0      0      1    1          2
#> 60   1346  3.125      1      0      0    1          2
#> 61   1405  5.000      1      0      0    1          2
#> 62   1406  3.750      1      0      0    1          2
#> 63   1411  4.375      1      0      0    1          2
#> 64   1459  3.750      1      0      0    1          2
#> 65   1507  4.375      0      1      1    0          2
#> 66   1528  4.375      0      1      1    0          2
#> 67   1581  5.000      1      0      0    1          2
#> 68   1602  5.000      1      1      0    0          2
#> 69   1648  3.750      0      1      1    0          2
#> 70   1688  3.125      0      1      0    1          2
#> 71   1732  3.750      0      0      1    1          2
#> 72   1744  3.750      1      0      0    1          2
#> 73   1757  3.750      1      0      0    1          2
#> 74   1794  5.000      1      1      0    0          2
#> 75   1797  3.125      1      0      0    1          2
#> 76   1805  4.375      0      1      1    0          2
#> 77   1806  4.375      1      1      0    0          2
#> 78   1832  3.750      1      0      0    1          2
#> 79   1846  5.000      1      0      0    1          2
#> 80   1895  3.750      1      0      0    1          2
#> 81   1932  4.375      1      0      0    1          2
#> 82   1940  4.375      1      0      0    1          2
#> 83   2020  4.375      0      1      1    0          2
#> 84   2035  5.000      0      1      1    0          2
#> 85   2069  5.000      1      0      0    1          2
#> 86   2088  4.375      0      1      1    0          2
#> 87   2093  4.375      1      0      0    1          2
#> 88   2117  4.375      1      0      0    1          2
#> 89   2137  4.375      1      0      0    1          2
#> 90   2207  4.375      0      0      1    1          2
#> 91   2241  3.750      1      0      0    1          2
#> 92   2243  5.000      1      0      0    1          2
#> 93   2245  4.375      1      0      0    1          2
#> 94   2261  5.000      0      1      1    0          2
#> 95   2269  4.375      1      0      0    1          2
#> 96   2328  4.375      1      0      0    1          2
#> 97   2331  4.375      1      0      0    1          2
#> 98   2380  3.750      1      0      0    1          2
#> 99   2417  3.750      0      1      1    0          2
#> 100  2518  3.750      1      0      0    1          2
#> 101  2610  3.750      0      1      1    0          2
#> 102  2620  4.375      1      0      0    1          2
#> 103  2694  3.125      1      0      0    1          2
#> 104  2717  4.375      1      0      0    1          2
#> 105  2727  5.000      0      0      1    1          2
#> 106  2733  3.750      1      0      0    1          2
#> 107  2746  5.000      0      1      1    0          2
#> 108  2756  3.125      1      0      0    1          2
#> 109  2769  4.375      1      0      0    1          2
#> 110  2778  4.375      0      1      1    0          2
#> 111  2825  5.000      1      0      0    1          2
#> 112  2835  3.750      1      0      0    1          2
#> 113  2838  3.750      0      1      1    0          2
#> 114  2842  3.750      1      0      0    1          2
#> 115  2898  3.750      1      0      0    1          2
#> 116  2938  4.375      0      1      1    0          2
#> 117  2952  4.375      0      1      1    0          2
#> 118  2963  5.000      1      0      0    1          2
#> 119  2965  5.000      1      1      0    0          2
#> 120  2966  3.750      1      0      0    1          2
#> 121  3007  3.125      1      0      0    1          2
#> 122  3010  4.375      0      1      0    1          2
#> 123  3049  4.375      1      0      0    1          2
#> 124  3076  4.375      1      0      0    1          2
#> 125  3149  4.375      0      1      1    0          2
#> 126  3172  3.750      1      0      0    1          2
#> 127  3212  4.375      1      0      0    1          2
#> 128  3219  3.750      0      1      1    0          2
#> 129  3233  3.750      1      0      0    1          2
#> 130  3238  5.000      0      1      1    0          2
#> 131  3257  5.000      1      0      0    1          2
#> 132  3310  4.375      1      0      0    1          2
#> 133  3369  4.375      1      0      0    1          2
#> 134  3402  3.750      1      0      0    1          2
#> 135  3406  4.375      1      0      0    1          2
#> 136  3417  4.375      0      1      1    0          2
#> 137  3468  3.125      0      1      1    0          2
#> 138  3548  4.375      1      0      0    1          2
#> 139  3562  4.375      1      0      0    1          2
#> 140  3569  3.750      1      0      0    1          2
#> 141  3600  5.000      1      0      0    1          2
#> 142  3618  5.000      0      1      1    0          2
#> 143  3663  4.375      1      0      0    1          2
#> 144  3698  4.375      1      0      0    1          2
#> 145  3734  4.375      1      0      1    0          2
#> 146  3773  3.750      1      0      0    1          2
#> 147  3881  4.375      0      1      1    0          2
#> 148  3926  4.375      1      0      0    1          2
#> 149  3945  4.375      1      0      0    1          2
#> 150  3985  4.375      1      0      0    1          2
#> 151  4016  3.750      0      1      1    0          2
#> 152  4025  4.375      1      0      0    1          2
#> 153  4040  4.375      0      1      1    0          2
#> 154  4056  4.375      0      1      1    0          2
#> 155  4064  4.375      1      0      0    1          2
#> 156  4074  4.375      0      1      0    1          2
#> 157  4195  4.375      0      1      1    0          2
#> 158  4204  4.375      1      0      0    1          2
#> 159  4241  5.000      0      1      0    1          2
#> 160  4377  4.375      1      0      0    1          2
#> 161  4414  3.125      1      0      0    1          2
#> 162  4473  4.375      1      0      0    1          2
#> 163  4510  4.375      0      0      1    1          2
#> 164  4539  4.375      1      0      0    1          2
#> 165  4552  4.375      1      0      0    1          2
#> 166  4637  4.375      1      0      0    1          2
#> 167  4682  4.375      1      0      0    1          2
#> 168  4737  4.375      1      0      0    1          2
#> 169  4788  4.375      1      0      0    1          2
#> 170  4839  4.375      1      0      0    1          2
#> 171  4887  3.750      1      0      0    1          2
#> 172  4946  4.375      1      0      0    1          2
#> 173  4966  4.375      0      1      1    0          2
#> 174  5149  5.000      0      1      1    0          2
#> 175  5151  4.375      1      0      0    1          2
#> 176  5216  3.750      1      0      0    1          2
#> 177  5235  4.375      0      1      1    0          2
#> 178  5373  4.375      1      0      0    1          2
#> 179  5407  1.875      1      0      0    1          2
#> 180  5410  5.000      1      0      1    0          2
#> 181  5450  3.750      1      0      0    1          2
#> 182  5451  4.375      0      1      1    0          2
#> 183  5485  4.375      0      1      1    0          2
#> 184  5516  3.750      1      1      0    0          2
#> 185  5537  2.500      0      1      1    0          2
#> 186  5597  4.375      1      0      0    1          2
#> 187  5613  3.125      0      0      1    1          2
#> 188  5622  3.750      1      0      0    1          2
#> 189  5655  4.375      1      0      0    1          2
#> 190  5660  4.375      1      1      0    0          2
#> 191  5689  4.375      1      0      0    1          2
#> 192  5757  4.375      0      0      1    1          2
#> 193  5775  1.875      1      0      0    1          2
#> 194  5778  3.750      1      0      0    1          2
#> 195  5820  3.750      1      0      1    0          2
#> 196  5880  5.000      0      1      1    0          2
#> 197  5905  1.875      1      0      0    1          2
#> 198  5975  4.375      1      0      0    1          2
#> 199  5996  4.375      1      0      0    1          2
#> 200  6006  4.375      1      0      0    1          2
#> 201  6053  4.375      0      1      0    1          2
#> 202  6124  4.375      1      0      1    0          2
#> 203  6148  4.375      1      0      0    1          2
#> 204  6150  4.375      1      0      0    1          2
#> 205  6155  3.750      0      1      1    0          2
#> 206  6175  3.750      1      0      0    1          2
#> 207  6235  3.125      1      0      0    1          2
#> 208  6238  4.375      1      0      0    1          2
#> 209  6241  5.000      0      1      1    0          2
#> 210  6260  3.750      0      1      0    1          2
#> 211  6310  4.375      1      0      0    1          2
#> 212  6315  4.375      1      0      0    1          2
#> 213  6358  4.375      0      0      1    1          2
#> 214  6373  3.125      0      1      1    0          2
#> 215  6422  5.000      0      0      1    1          2
#> 216  6426  4.375      1      0      0    1          2
#> 217  6432  5.000      0      1      1    0          2
#> 218  6467  4.375      1      0      0    1          2
#> 219  6479  4.375      0      1      1    0          2
#> 220  6515  4.375      1      0      0    1          2
#> 221  6564  3.750      0      1      1    0          2
#> 222  6569  4.375      1      0      0    1          2
#> 223  6573  4.375      1      0      0    1          2
#> 224  6614  3.750      0      0      1    1          2
#> 225  6648  5.000      1      0      0    1          2
#> 226  6650  5.000      1      0      0    1          2
#> 227  6704  4.375      1      0      0    1          2
#> 228  6709  3.750      1      0      0    1          2
#> 229  6780  4.375      0      1      1    0          2
#> 230  6825  4.375      1      0      0    1          2
#> 231  6847  4.375      0      1      1    0          2
#> 232  7121  3.750      1      0      0    1          2
#> 233  7140  4.375      1      0      0    1          2
#> 234  7144  3.750      0      1      1    0          2
#> 235  7282  4.375      0      1      1    0          2
#> 236  7285  3.750      1      0      0    1          2
#> 237  7310  3.750      1      0      0    1          2
#> 238  7326  4.375      1      0      0    1          2
#> 239  7333  3.125      0      1      0    1          2
#> 240  7397  5.000      1      0      0    1          2
#> 241  7480  5.000      0      1      1    0          2
#> 242  7482  4.375      0      1      1    0          2
#> 243  7532  5.000      0      1      1    0          2
#> 244  7557  4.375      1      0      0    1          2
#> 245  7565  3.750      1      0      0    1          2
#> 246  7586  4.375      1      0      0    1          2
#> 247  7754  3.125      1      0      0    1          2
#> 248  7782  4.375      1      0      0    1          2
#> 249  7803  4.375      1      0      0    1          2
#> 250  7815  4.375      0      0      1    1          2
#> 251  7829  4.375      0      1      1    0          2
#> 252  7871  4.375      1      0      0    1          2
#> 253  7898  5.000      0      1      1    0          2
#> 254  7997  4.375      0      0      1    1          2
#> 255  8019  3.750      1      0      0    1          2
#> 256  8021  3.750      1      0      0    1          2
#> 257  8105  4.375      0      1      1    0          2
#> 258  8106  5.000      1      0      0    1          2
#> 259  8107  4.375      0      1      1    0          2
#> 260  8218  4.375      1      0      0    1          2
#> 261  8228  4.375      1      0      0    1          2
#> 262  8253  4.375      0      1      1    0          2
#> 263  8266  3.750      0      1      1    0          2
#> 264  8270  4.375      0      1      1    0          2
#> 265  8312  3.750      1      0      0    1          2
#> 266  8417  4.375      1      0      0    1          2
#> 267  8442  5.000      0      1      1    0          2
#> 268  8459  3.750      0      1      1    0          2
#> 269  8461  4.375      0      1      1    0          2
#> 270  8641  3.750      0      1      1    0          2
#> 271  8693  3.750      1      1      0    0          2
#> 272  8705  3.750      0      0      1    1          2
#> 273  8717  4.375      1      0      0    1          2
#> 274  8728  4.375      1      0      0    1          2
#> 275  8740  4.375      0      1      1    0          2
#> 276  8792  4.375      0      1      1    0          2
#> 277  8839  4.375      1      0      0    1          2
#> 278  8856  4.375      1      0      0    1          2
#> 279  8923  4.375      1      0      0    1          2
#> 280  8942  4.375      1      0      0    1          2
#> 281  8977  3.750      1      0      0    1          2
#> 282  8989  4.375      1      0      0    1          2
#> 283  8994  4.375      1      0      0    1          2
#> 284  9000  4.375      1      0      0    1          2
#> 285  9054  3.125      0      1      1    0          2
#> 286  9055  3.750      0      1      0    1          2
#> 287  9080  4.375      1      0      0    1          2
#> 288  9081  3.750      0      1      1    0          2
#> 289  9087  4.375      1      1      0    0          2
#> 290  9088  4.375      0      1      1    0          2
#> 291  9092  3.750      1      0      0    1          2
#> 292  9141  4.375      1      0      0    1          2
#> 293  9213  3.125      1      1      0    0          2
#> 294  9238  4.375      1      0      0    1          2
#> 295  9264  4.375      0      1      1    0          2
#> 296  9277  4.375      1      0      0    1          2
#> 297  9333  3.750      1      0      0    1          2
#> 298  9345  4.375      0      0      1    1          2
#> 299  9354  4.375      0      1      1    0          2
#> 300  9365  3.750      0      1      1    0          2
#> 301  9378  4.375      1      0      0    1          2
#> 302  9399  3.750      0      1      1    0          2
#> 303  9514  4.375      1      0      0    1          2
#> 304  9523  4.375      1      0      0    1          2
#> 305  9533  5.000      1      0      0    1          2
#> 306  9582  4.375      1      0      0    1          2
#> 307  9636  4.375      0      1      1    0          2
#> 308  9638  3.750      1      0      0    1          2
#> 309  9692  5.000      1      0      0    1          2
#> 310  9701  3.750      1      0      0    1          2
#> 311  9775  5.000      1      0      0    1          2
#> 312  9790  4.375      1      0      0    1          2
#> 313  9879  3.125      1      0      0    1          2
#> 314  9909  3.750      0      1      1    0          2
#> 315  9930  5.000      0      1      1    0          2
#> 316  9961  4.375      0      1      0    1          2
#> 317  9983  5.000      1      0      0    1          2
#> 318 10009  3.125      1      0      0    1          2
#> 319 10016  4.375      1      0      0    1          2
#> 320 10017  4.375      0      0      1    1          2
#> 321 10043  4.375      1      0      0    1          2
#> 322 10045  3.750      1      0      0    1          2
#> 323 10063  3.750      1      0      0    1          2
#> 324 10067  3.750      1      0      0    1          2
#> 325 10069  4.375      0      1      1    0          2
#> 326 10118  5.000      1      0      0    1          2
#> 327 10153  4.375      0      1      1    0          2
#> 328 10155  4.375      1      1      0    0          2
#> 329 10216  3.750      1      0      0    1          2
#> 330 10257  4.375      1      0      0    1          2
#> 331 10262  3.125      1      1      0    0          2
#> 332 10276  3.750      1      0      0    1          2
#> 333 10307  3.750      1      0      0    1          2
#> 334 10318  3.750      1      0      0    1          2
#> 335 10339  3.750      1      0      0    1          2
#> 336 10389  4.375      1      0      0    1          2
#> 337 10400  5.000      0      1      1    0          2
#> 338 10424  3.750      1      0      0    1          2
#> 339 10436  5.000      1      0      0    1          2
#> 340 10442  4.375      1      0      0    1          2
#> 341 10499  4.375      0      1      1    0          2
#> 342 10505  5.000      1      0      0    1          2
#> 343 10511  4.375      1      0      1    0          2
#> 344 10594  5.000      1      0      0    1          2
#> 345 10756  3.125      1      0      0    1          2
#> 346 10805  4.375      1      0      0    1          2
#> 347 10881  3.750      0      1      1    0          2
#> 348 10923  4.375      1      0      0    1          2
#> 349 10939  5.000      1      0      0    1          2
#> 350 10968  5.000      1      0      0    1          2
#> 351 10975  4.375      1      0      0    1          2
#> 352 10989  3.750      0      0      1    1          2
#> 353 10997  3.750      0      1      1    0          2
#> 354 11038  3.750      1      0      0    1          2
#> 355 11046  4.375      1      0      0    1          2
#> 356 11052  5.000      0      1      1    0          2
#> 357 11074  4.375      0      1      1    0          2
#> 358 11170  4.375      1      0      0    1          2
#> 359 11182  3.750      1      0      0    1          2
#> 360 11199  3.750      1      0      0    1          2
#> 361 11237  3.750      0      1      1    0          2
#> 362 11385  4.375      0      0      1    1          2
#> 363 11387  3.125      1      0      0    1          2
#> 364 11441  3.750      1      0      0    1          2
#> 365 11453  1.250      0      0      1    1          2
#> 366 11538  4.375      1      0      0    1          2
#> 367 11580  4.375      0      0      1    1          2
#> 368 11587  5.000      1      0      0    1          2
#> 369 11613  4.375      1      0      0    1          2
#> 370 11623  5.000      0      1      1    0          2
#> 371 11640  4.375      1      0      0    1          2
#> 372 11694  4.375      0      1      1    0          2
#> 373 11707  3.750      1      0      0    1          2
#> 374 11723  3.750      1      0      0    1          2
#> 375 11727  3.750      1      0      0    1          2
#> 376 11738  3.125      0      1      1    0          2
#> 377 11794  3.750      1      0      0    1          2
#> 378 11823  3.750      1      0      0    1          2
#> 379 11858  3.750      0      1      1    0          2
#> 380 11907  2.500      1      0      0    1          2
#> 381 11952  3.750      1      0      0    1          2
#> 382 12062  4.375      1      0      0    1          2
#> 383 12069  4.375      1      0      1    0          2
#> 384 12179  5.000      0      1      1    0          2
#> 385 12233  3.125      1      0      0    1          2
#> 386 12251  4.375      1      0      0    1          2
#> 387 12321  4.375      0      1      1    0          2
#> 388 12330  3.750      1      0      0    1          2
#> 389 12451  3.750      0      1      1    0          2
#> 390 12458  4.375      1      0      0    1          2
#> 391 12480  3.125      0      1      0    1          2
#> 392 12490  3.750      0      1      1    0          2
#> 393 12505  4.375      1      0      0    1          2
#> 394 12533  5.000      0      1      1    0          2
#> 395 12651  3.750      1      0      0    1          2
#> 396 12717  5.000      1      0      0    1          2
#> 397 12739  3.750      0      1      1    0          2
#> 398 12761  4.375      1      0      0    1          2
#> 399 12797  4.375      1      0      0    1          2
#> 400 12832  4.375      1      0      0    1          2
#> 401 12834  5.000      1      0      0    1          2
#> 402 12882  3.125      1      0      0    1          2
#> 403 12899  4.375      1      0      0    1          2
#> 404 12905  4.375      1      0      0    1          2
#> 405 12974  3.750      1      0      0    1          2
#> 406 12979  4.375      1      0      0    1          2
#> 407 13041  5.000      1      0      0    1          2
#> 408 13054  3.750      1      1      0    0          2
#> 409 13112  3.750      1      0      0    1          2
#> 410 13125  5.000      1      0      0    1          2
#> 411 13147  4.375      1      0      0    1          2
#> 412 13200  3.750      1      0      0    1          2
#> 413 13242  3.750      1      0      0    1          2
#> 414 13270  4.375      1      0      0    1          2
#> 415 13284  3.750      1      0      0    1          2
#> 416 13292  3.750      1      0      0    1          2
#> 417 13331  4.375      1      0      0    1          2
#> 418 13357  4.375      1      0      0    1          2
#> 419 13385  4.375      1      0      0    1          2
#> 420 13387  4.375      0      1      1    0          2
#> 421 13412  4.375      1      0      0    1          2
#> 422 13422  4.375      1      0      1    0          2
#> 423 13424  5.000      0      1      1    0          2
#> 424 13562  3.750      1      0      0    1          2
#> 425 13723  3.750      0      0      1    1          2
#> 426 13846  3.750      1      1      0    0          2
#> 427 13890  5.000      0      1      1    0          2
#> 428 13912  4.375      0      0      1    1          2
#> 429 13916  3.750      0      1      1    0          2
#> 430 13940  4.375      1      0      0    1          2
#> 431 14000  4.375      1      0      0    1          2
#> 432 14006  3.125      1      0      1    0          2
#> 433 14022  4.375      1      0      0    1          2
#> 434 14067  5.000      0      1      0    1          2
#> 435 14078  5.000      0      0      1    1          2
#> 436 14086  4.375      1      0      0    1          2
#> 437 14130  4.375      1      0      0    1          2
#> 438 14175  3.125      0      1      0    1          2
#> 439 14214  4.375      0      1      1    0          2
#> 440 14256  3.750      0      1      0    1          2
#> 441 14261  5.000      1      0      0    1          2
#> 442 14264  4.375      1      0      0    1          2
#> 443 14425  4.375      1      0      0    1          2
#> 444 14443  2.500      1      0      0    1          2
#> 445 14447  4.375      0      1      1    0          2
#> 446 14496  3.750      1      0      0    1          2
#> 447 14535  3.750      0      1      0    1          2
#> 448 14563  5.000      0      1      1    0          2
#> 449 14581  5.000      0      1      1    0          2
#> 450 14582  4.375      1      0      1    0          2
#> 451 14613  4.375      1      0      0    1          2
#> 452 14673  5.000      1      0      0    1          2
#> 453 14718  4.375      1      0      0    1          2
#> 454 14726  3.750      1      0      0    1          2
#> 455 14729  3.750      0      0      1    1          2
#> 456 14750  4.375      1      0      0    1          2
#> 457 14755  4.375      1      0      0    1          2
#> 458 14768  3.750      0      0      1    1          2
#> 459 14774  4.375      0      1      1    0          2
#> 460 14784  3.750      0      1      0    1          2
#> 461 14791  3.125      1      0      0    1          2
#> 462 14835  3.750      0      0      1    1          2
#> 463 14946  4.375      1      0      0    1          2
#> 464 14967  3.750      1      0      0    1          2
#> 465 15026  3.750      1      0      0    1          2
#> 466 15044  4.375      1      0      0    1          2
#> 467 15054  5.000      0      1      1    0          2
#> 468 15072  3.750      1      0      0    1          2
#> 469 15083  2.500      1      0      0    1          2
#> 470 15115  2.500      1      0      0    1          2
#> 471 15134  4.375      1      0      0    1          2
#> 472 15137  4.375      1      0      0    1          2
#> 473 15141  4.375      1      0      0    1          2
#> 474 15161  4.375      0      1      1    0          2
#> 475 15179  3.750      1      0      0    1          2
#> 476 15217  5.000      1      0      0    1          2
#> 477 15289  3.125      1      0      1    0          2
#> 478 15302  4.375      0      1      1    0          2
#> 479 15315  3.750      0      1      1    0          2
#> 480 15355  3.750      1      0      0    1          2
#> 481 15357  2.500      1      0      0    1          2
#> 482 15379  5.000      0      1      1    0          2
#> 483 15480  4.375      0      1      1    0          2
#> 484 15512  4.375      0      1      1    0          2
#> 485 15542  3.125      1      0      0    1          2
#> 486 15613  4.375      1      0      0    1          2
#> 487 15644  4.375      0      1      0    1          2
#> 488 15667  3.750      0      1      1    0          2
#> 489 15707  5.000      1      0      0    1          2
#> 490 15738  4.375      1      0      0    1          2
#> 491 15879  4.375      0      1      1    0          2
#> 492 15959  5.000      0      1      1    0          2
#> 493 16010  4.375      1      1      0    0          2
#> 494 16037  3.750      1      1      0    0          2
#> 495 16040  4.375      0      1      1    0          2
#> 496 16042  3.750      1      0      0    1          2
#> 497 16074  1.875      0      1      1    0          2
#> 498 16100  4.375      0      1      1    0          2
#> 499 16153  4.375      1      0      0    1          2
#> 500 16183  4.375      0      1      0    1          2
#> 501 16212  4.375      1      0      0    1          2
#> 502 16215  3.125      1      0      1    0          2
#> 503 16216  4.375      1      0      0    1          2
#> 504 16222  4.375      1      0      0    1          2
#> 505 16280  4.375      1      0      0    1          2
#> 506 16308  4.375      1      0      0    1          2
#> 507 16334  4.375      1      1      0    0          2
#> 508 16337  3.750      0      1      1    0          2
#> 509 16369  4.375      1      0      0    1          2
#> 510 16394  4.375      0      1      1    0          2
#> 511 16418  4.375      0      0      1    1          2
#> 512 16474  3.750      0      1      1    0          2
#> 513 16486  4.375      1      0      0    1          2
#> 514 16524  3.750      1      0      0    1          2
#> 515 16571  4.375      0      0      1    1          2
#> 516 16596  4.375      1      0      0    1          2
#> 517 16620  5.000      0      1      1    0          2
#> 518 16637  4.375      1      0      0    1          2
#> 519 16639  5.000      1      0      0    1          2
#> 520 16660  4.375      1      0      1    0          2
#> 521 16702  3.750      0      1      1    0          2
#> 522 16748  1.250      0      1      1    0          2
#> 523 16777  3.750      1      0      0    1          2
#> 524 16787  3.750      1      1      0    0          2
#> 525 16919  5.000      0      1      1    0          2
#> 526 16961  3.750      1      0      0    1          2
#> 527 16981  3.125      1      0      0    1          2
#> 528 16993  4.375      1      0      0    1          2
#> 529 16997  3.750      1      0      0    1          2
#> 530 17004  4.375      0      1      1    0          2
#> 531 17051  4.375      1      0      0    1          2
#> 532 17063  4.375      0      1      1    0          2
#> 533 17071  5.000      0      1      1    0          2
#> 534 17124  4.375      0      1      0    1          2
#> 535 17165  4.375      1      0      0    1          2
#> 536 17219  5.000      1      0      0    1          2
#> 537 17234  3.750      1      0      0    1          2
#> 538 17264  3.750      1      0      0    1          2
#> 539 17287  4.375      1      0      0    1          2
#> 540 17289  3.750      1      0      0    1          2
#> 541 17342  3.750      0      1      1    0          2
#> 542 17363  4.375      1      0      0    1          2
#> 543 17380  4.375      1      1      0    0          2
#> 544 17491  3.750      1      0      0    1          2
#> 545 17495  3.750      1      0      0    1          2
#> 546 17499  4.375      1      0      0    1          2
#> 547 17547  2.500      1      0      0    1          2
#> 548 17584  5.000      1      0      0    1          2
#> 549 17600  5.000      1      0      0    1          2
#> 550 17614  4.375      1      0      0    1          2
#> 551 17629  3.750      1      0      0    1          2
#> 552 17645  3.750      0      1      1    0          2
#> 553 17660  4.375      1      0      0    1          2
#> 554 17696  4.375      1      0      0    1          2
#> 555 17697  4.375      1      0      0    1          2
#> 556 17752  4.375      1      0      0    1          2
#> 557 17864  4.375      1      0      0    1          2
#> 558 17886  5.000      1      0      0    1          2
#> 559 17904  3.750      0      1      0    1          2
#> 560 17978  5.000      1      0      0    1          2
#> 561 17997  4.375      1      0      0    1          2
#> 562 18014  3.750      1      0      0    1          2
#> 563 18029  4.375      0      1      1    0          2
#> 564 18085  4.375      1      0      1    0          2
#> 565 18143  4.375      1      0      0    1          2
#> 566 18153  2.500      1      0      0    1          2
#> 567 18213  5.000      1      0      1    0          2
#> 568 18236  2.500      0      0      1    1          2
#> 569 18251  3.750      0      1      1    0          2
#> 570 18268  4.375      1      0      0    1          2
#> 571 18325  3.750      1      0      0    1          2
#> 572 18327  4.375      1      0      0    1          2
#> 573 18446  4.375      1      0      1    0          2
#> 574 18530  3.750      1      0      0    1          2
#> 575 18602  4.375      0      1      1    0          2
#> 576 18825  4.375      1      0      0    1          2
#> 577 18832  3.750      1      0      0    1          2
#> 578 18837  5.000      1      0      0    1          2
#> 579 18852  4.375      1      0      0    1          2
#> 580 18885  4.375      1      0      0    1          2
#> 581 18982  3.750      0      1      1    0          2
#> 582 19108  3.125      1      0      0    1          2
#> 583 19117  3.125      1      0      0    1          2
#> 584 19205  5.000      1      0      0    1          2
#> 585 19273  4.375      1      0      0    1          2
#> 586 19329  4.375      0      1      1    0          2
#> 587 19338  3.750      1      0      0    1          2
#> 588 19369  4.375      0      1      1    0          2
#> 589 19399  3.750      1      0      0    1          2
#> 590 19493  3.750      0      1      1    0          2
#> 591 19539  4.375      0      1      1    0          2
#> 592 19547  3.750      1      0      0    1          2
#> 593 19563  4.375      1      0      0    1          2
#> 594 19615  4.375      1      0      0    1          2
#> 595 19625  4.375      1      0      0    1          2
#> 596 19627  3.750      1      0      0    1          2
#> 597 19643  3.125      0      0      1    1          2
#> 598 19685  4.375      1      0      0    1          2
#> 599 19792  3.750      0      1      1    0          2
#> 600 19802  4.375      1      0      0    1          2
#> 601 19815  4.375      1      0      0    1          2
#> 602 19881  4.375      0      1      1    0          2
#> 603 19889  4.375      1      0      0    1          2
#> 604 19905  5.000      0      1      1    0          2
#> 605 20041  4.375      1      0      0    1          2
#> 606 20049  4.375      0      1      1    0          2
### should we keep those observations???

As we can see, there is again no correlation between seasons and recipe ratings.

seasons_df %>% 
  select(-sum_season, -ID) %>% 
  plot_correlation()

CAN TRY TO feature engineer rating above 4

Inconclusive result, again.

seasons_df <-seasons_df %>% 
  mutate(rating_above_4 = ifelse(rating > 4, 1, 0), rating_5 = ifelse(rating == 5, 1, 0), rating_1.25 = ifelse(rating == 1.25, 1, 0))

seasons_df %>% 
  select(rating, rating_above_4, all_of(seasons_vec))%>% 
  plot_correlation()

seasons_df %>%
  select(rating_5, rating_1.25, all_of(seasons_vec)) %>%
  plot_correlation()

4.5.2 Recipe Type

#TO ADD WHEN COMPLETED

#taking into account only the "main" recipe types
recipe_types_select <- c("breakfast", "brunch", "dessert", "dinner", "lunch")

type_df <- recipes %>% 
  select(ID, rating, all_of(recipe_types_select)) %>% 
  filter(if_any(all_of(recipe_types_select)) == 1) %>% 
  mutate(sum_type= rowSums(across(all_of(recipe_types_select))))

type_df %>% 
  ggplot(aes(x=sum_type)) +
  geom_bar()

# type_df %>% 
#   filter(sum_type == 2)

# left with 3333 obs after filtering
type_df <- type_df %>% 
  filter(!sum_type >1)

Once again very low correlation for recipe types, we will not included them in the analysis.

type_df %>% 
  select(-c(ID, sum_type)) %>% 
  plot_correlation()

4.5.3 Countries

Only 54 recipes containing info about the country so we can’t use it either for the analysis.

countries_df <- recipes %>% 
  select(ID, rating, all_of(countries)) %>% 
  filter(if_any(all_of(countries)) == 1) %>% 
  mutate(sum_type= rowSums(across(all_of(countries))))

countries_df %>% 
  ggplot(aes(x=sum_type)) +
  geom_bar()

5 Supervised Learning

5.1 0. Data Normalisation, splitting and balancing

Creating a new dataframe for analysis purposes, which includes all nutrition-related features, as well as all ingredient-related features. We name it recipes_analysis.

nutritional_df <- recipes %>% 
  select(ID, all_of(nutritional_values))
  
###### CAREFUL --> recipes_analysis should be of dim 10163 x 33
recipes_analysis <- ingredients_df_full %>% 
  left_join(nutritional_df, by="ID") %>% 
  mutate(across(all_of(contains("bin")), as.factor) , ID = as.character(ID)) %>% 
  select(rating, all_of(nutritional_values), contains("bin"), contains("total"))

Creating a normalized version of recipes_analysis where the continuous numerical values are normalised, to remain consistent across all models for our analysis.

my_normalise <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

recipes_analysis_normd <- recipes_analysis %>% 
  mutate(rating = as.factor(rating), across(where(is.numeric), my_normalise))

Creating training and test set. We chose a 75/25 split

#creating training and test sets
set.seed(12)
index <- createDataPartition(recipes_analysis_normd$rating, p=0.75, list = FALSE)

We create a training set with the original rating variable with 7 levels.

Below we can see the multilevel rating training set is really unbalanced throughout the classes.

#creating training and test set with multilevel rating variable
train_multi <- recipes_analysis_normd[index, ]
test_multi <- recipes_analysis_normd[-index, ]

table(train_multi$rating)
#> 
#>  1.25 1.875   2.5 3.125  3.75 4.375     5 
#>    64    36   204   624  2187  3480  1029

5.1.1 Creating train and test sets for some variable combination

#for train multi
tr_multi_all <- train_multi
te_multi_all <- test_multi

tr_multi_Nut <- train_multi %>% 
  select(rating, all_of(nutritional_values))
te_multi_Nut <- test_multi %>% 
  select(rating, all_of(nutritional_values))

5.1.2 TrainControl Functions

Here we create the train control functions we will use throughout the project.

#creating train control data for unbalanced data
trCtrl <- trainControl(method = "cv",
                       number = 5)
#train control with downsampling
trCtrl_down <- trainControl(method = "cv",
                       number = 5,
                       sampling = "down")
#train control with downsampling twoclass summary
trCtrl_down_twoClass <- trainControl(method = "cv",
                       summaryFunction = twoClassSummary,
                       classProbs = TRUE,
                       number = 5,
                       sampling = "down"
                       )

5.2 1.0 Testing performance on 7-level rating classification with RF

Given our correlation results in EDA, we anticipated that a classification with 7 levels will basically be equivalent to randomly classifying each observation in one of the 7 rating levels. However, to make sure this is the case, we try running a random forest on the nutritional values, as well as both the “total” and “bin” ingredients-related features we created during the EDA. We use a random forest for this testing phase because, being an ensemble method, it should give us reasonable results if such results are possible with the data we have. If the results are bad, we can assume that other models will not magically classify everything perfectly.

5.2.1 Fitting RF - multi_all

Hyperparameters

Best tune: mtry = 2

Accuracy metrics

  • ACC 45.6
  • Kappa 0.002
set.seed(12)
rf_multi_all_fit <- caret::train(rating ~ .,
                     data = tr_multi_all,
                     method = "rf",
                     preProcess = NULL,
                     trControl = trCtrl,
                     tuneLength = 3)
#show the random forest with cv 
rf_multi_all_fit
#> Random Forest 
#> 
#> 7624 samples
#>   32 predictor
#>    7 classes: '1.25', '1.875', '2.5', '3.125', '3.75', '4.375', '5' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6100, 6100, 6098, 6099, 6099 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa   
#>    2    0.4568    0.003999
#>   17    0.4343    0.050215
#>   32    0.4332    0.050089
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was mtry = 2.
plot(rf_multi_all_fit, main = "Plot of CV Accuracy relative to Mtry parameter")

5.2.2 Predictions RF - multi_all

As can be seen by the predictions below, our assumptions were valid. The model does a really bad job at classifying the test set into the correct rating categories. It is equivalent to a random model with a very strong bias towards the majority class at 4.375 (due to the data imbalance present in the rating variable.)

#make predictions
rf_multi_all_pred <- predict(rf_multi_all_fit, newdata = test_multi)

#confusion matrix 
confusionMatrix(data = rf_multi_all_pred, reference = test_multi$rating)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction 1.25 1.875  2.5 3.125 3.75 4.375    5
#>      1.25     0     0    0     0    0     0    0
#>      1.875    0     0    0     0    0     0    0
#>      2.5      0     0    0     0    0     0    0
#>      3.125    0     0    0     0    0     0    0
#>      3.75     2     1    1     4    5     7    5
#>      4.375   19    11   67   204  724  1151  337
#>      5        0     0    0     0    0     1    0
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.455         
#>                  95% CI : (0.436, 0.475)
#>     No Information Rate : 0.456         
#>     P-Value [Acc > NIR] : 0.555         
#>                                         
#>                   Kappa : 0.001         
#>                                         
#>  Mcnemar's Test P-Value : NA            
#> 
#> Statistics by Class:
#> 
#>                      Class: 1.25 Class: 1.875 Class: 2.5
#> Sensitivity              0.00000      0.00000     0.0000
#> Specificity              1.00000      1.00000     1.0000
#> Pos Pred Value               NaN          NaN        NaN
#> Neg Pred Value           0.99173      0.99527     0.9732
#> Prevalence               0.00827      0.00473     0.0268
#> Detection Rate           0.00000      0.00000     0.0000
#> Detection Prevalence     0.00000      0.00000     0.0000
#> Balanced Accuracy        0.50000      0.50000     0.5000
#>                      Class: 3.125 Class: 3.75 Class: 4.375 Class: 5
#> Sensitivity                0.0000     0.00686        0.993 0.000000
#> Specificity                1.0000     0.98895        0.013 0.999545
#> Pos Pred Value                NaN     0.20000        0.458 0.000000
#> Neg Pred Value             0.9181     0.71201        0.692 0.865248
#> Prevalence                 0.0819     0.28712        0.456 0.134699
#> Detection Rate             0.0000     0.00197        0.453 0.000000
#> Detection Prevalence       0.0000     0.00985        0.990 0.000394
#> Balanced Accuracy          0.5000     0.49790        0.503 0.499772

5.2.3 Fitting - RF multi_Nut

We now try another RF but only with the nutritional values as explanatory variables to see if it changes something.

Hyperparameters

Best tune: mtry = 2

Accuracy metrics

  • ACC 41.6
  • Kappa 0.048
set.seed(12)
rf_multi_Nut_fit <- train(rating ~ .,
                     data = tr_multi_Nut,
                     method = "rf",
                     preProcess = NULL,
                     trControl = trCtrl,
                     tuneLength = 3)
#show the random forest with cv 
rf_multi_Nut_fit
#> Random Forest 
#> 
#> 7624 samples
#>    4 predictor
#>    7 classes: '1.25', '1.875', '2.5', '3.125', '3.75', '4.375', '5' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6100, 6100, 6098, 6099, 6099 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa  
#>   2     0.4122    0.03741
#>   3     0.4073    0.03377
#>   4     0.4083    0.03771
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was mtry = 2.
plot(rf_multi_Nut_fit, main = "Plot of CV Accuracy relative to Mtry parameter")

5.2.4 Predictions - RF multi_Nut

As can be seen by the predictions below, our assumptions were valid. The model does a really bad job at classifying the test set into the correct rating categories. It is equivalent to a random model with a very strong bias towards the majority class at 4.375 (due to the data imbalance present in the rating variable.)

#make predictions
rf_multi_Nut_pred <- predict(rf_multi_Nut_fit, newdata = te_multi_all)

#confusion matrix 
confusionMatrix(data = rf_multi_Nut_pred, reference = te_multi_Nut$rating)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction 1.25 1.875 2.5 3.125 3.75 4.375   5
#>      1.25     0     0   0     0    0     0   0
#>      1.875    0     0   0     0    0     0   0
#>      2.5      0     0   0     0    2     5   0
#>      3.125    0     0   1     7   12    15   8
#>      3.75     4     6  13    53  184   264  72
#>      4.375   15     5  44   134  496   815 209
#>      5        2     1  10    14   35    60  53
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.417         
#>                  95% CI : (0.398, 0.437)
#>     No Information Rate : 0.456         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.049         
#>                                         
#>  Mcnemar's Test P-Value : NA            
#> 
#> Statistics by Class:
#> 
#>                      Class: 1.25 Class: 1.875 Class: 2.5
#> Sensitivity              0.00000      0.00000    0.00000
#> Specificity              1.00000      1.00000    0.99717
#> Pos Pred Value               NaN          NaN    0.00000
#> Neg Pred Value           0.99173      0.99527    0.97314
#> Prevalence               0.00827      0.00473    0.02678
#> Detection Rate           0.00000      0.00000    0.00000
#> Detection Prevalence     0.00000      0.00000    0.00276
#> Balanced Accuracy        0.50000      0.50000    0.49858
#>                      Class: 3.125 Class: 3.75 Class: 4.375 Class: 5
#> Sensitivity               0.03365      0.2524        0.703   0.1550
#> Specificity               0.98456      0.7724        0.346   0.9445
#> Pos Pred Value            0.16279      0.3087        0.474   0.3029
#> Neg Pred Value            0.91947      0.7195        0.581   0.8777
#> Prevalence                0.08192      0.2871        0.456   0.1347
#> Detection Rate            0.00276      0.0725        0.321   0.0209
#> Detection Prevalence      0.01694      0.2347        0.677   0.0689
#> Balanced Accuracy         0.50910      0.5124        0.524   0.5497

5.2.5 Evaluation and decision for rest of analysis

Since we see a huge bias towards the majority class in unbalanced data (which we expected), we have to take a decision between 3 options:

  • We could try balancing the 7 levels, either up or down. Due to the very large disparity in observation counts per level (min at 32 and max at 3481), we believe than neither up- nor downsampling would be ideal. Indeed, upsampling would mean creating a massive amount of artificial observations through replacement for the minority classes, while downsampling would leave us with an insufficient total amount of observations.
  • We could try to aggregate the minority classes into one single class (i.e., putting 1.25, 1.875, 2.5, and 3.125 together), leaving us with 4 final classes. There would still be a significant imbalance however, and given the near-random classification that we observed above, we do not believe aggregating those classes is the solution.
  • We could transform the 7 levels into a binary classification problem. This would help both with class imbalance and would hopefully also lead to higher accuracy metrics by simplifying the classification task.

After careful reflection, we believe that the 3rd options is our best bet. We therefore transform the 7-level rating variable into a binary rating variable, with the threshold at 4. This means that all ratings below 4 are now considered as “bad” and all ratings above 4 are considered as “good”.

5.3 1.1 33 Engineered variables - unbalanced

#function to create the summary tables from the confusion matrices
summary_table_fun <- function(x){
  acc <- x[[3]][[1]]
  kap <- x[[3]][[2]]
  bacc <- x[[4]][[11]]
  spec <- x[[4]][[1]]
  vec <- c(acc, bacc, spec, kap)
  return(vec)
}

#creating empty summary table for unbalanced data
summary_table_unbal <- tibble(metric = c("accuracy", "balanced_accuracy", "sensitivity", "kappa"))

5.3.1 Data preparation for analysis

5.3.1.1 Creating new rating_bin variable

We transform the original 7-level rating variable into a binary rating_bin variable, “bad” or “good” (i.e., with a threshold at 4 to decide if the rating is bad or good).

#creating binary rating variable
recipes_analysis_bin <- recipes_analysis %>% 
  mutate(rating_bin = as.factor(ifelse(rating<4, "bad", "good"))) %>% 
  select(-rating)

#reversing rating_bin factor order
recipes_analysis_bin$rating_bin <- factor(recipes_analysis_bin$rating_bin, levels=rev(levels(recipes_analysis_bin$rating_bin )))

#normalising newly created df
recipes_analysis_bin <- recipes_analysis_bin %>% 
  mutate(across(where(is.numeric), my_normalise))

5.3.1.2 Creating rating_bin training and test sets

#creating training and test sets
set.seed(12)
index_bin <- createDataPartition(recipes_analysis_bin$rating_bin, p=0.75, list = FALSE)

We create a training and test set with the new rating_bin variable with only 2 levels.

Below we can see the binary rating training set is still slightly unbalanced throughout the two classes.

#creating training and test set with multilevel rating variable
train_bin <- recipes_analysis_bin[index_bin, ]
test_bin <- recipes_analysis_bin[-index_bin, ]

table(train_bin$rating_bin)
#> 
#> good  bad 
#> 4508 3115

5.3.1.3 Creating various train/test sets with different variable combinations

#creating train and test set for all combinations of variables, for binary rating df
train_bin_Nut <- train_bin %>% 
  select(rating_bin, all_of(nutritional_values))
test_bin_Nut <- test_bin %>% 
  select(rating_bin, all_of(nutritional_values))

train_bin_NuTo <- train_bin %>% 
  select(rating_bin, all_of(nutritional_values), all_of(contains("total")))
test_bin_NuTo <- test_bin %>% 
  select(rating_bin, all_of(nutritional_values), all_of(contains("total")))

train_bin_NuBi <- train_bin %>% 
  select(rating_bin, all_of(nutritional_values), all_of(contains("bin")))
test_bin_NuBi <- test_bin %>% 
  select(rating_bin, all_of(nutritional_values), all_of(contains("bin")))

train_bin_Tot <- train_bin %>% 
  select(rating_bin, all_of(contains("total")))
test_bin_Tot <- test_bin %>% 
  select(rating_bin, all_of(contains("total")))

train_bin_Bin <- train_bin %>% 
  select(rating_bin, all_of(contains("bin")))
test_bin_Bin <- test_bin %>% 
  select(rating_bin, all_of(contains("bin")))

5.3.2 Logistic Regression - unbalanced - Nut

5.3.2.1 Fitting

set.seed(12)
logr_Nut_unbal_fit <- caret::train(rating_bin ~.,
                data = train_bin_Nut,
                method = "glm",
                family = "binomial",
                trControl = trCtrl)
logr_Nut_unbal_fit
#> Generalized Linear Model 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results:
#> 
#>   Accuracy  Kappa
#>   0.5914    0

5.3.2.2 Predictions

We see that the relatively high accuracy is achieved by classifying most obs as good, resulting in a high sensitivity but a very poor specificity.

logr_Nut_unbal_pred <- predict(logr_Nut_unbal_fit, newdata = test_bin_Nut, type = "raw")

CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = logr_Nut_unbal_pred, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1502 1038
#>       bad     0    0
#>                                         
#>                Accuracy : 0.591         
#>                  95% CI : (0.572, 0.611)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.509         
#>                                         
#>                   Kappa : 0             
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 1.000         
#>             Specificity : 0.000         
#>          Pos Pred Value : 0.591         
#>          Neg Pred Value :   NaN         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.591         
#>    Detection Prevalence : 1.000         
#>       Balanced Accuracy : 0.500         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_unbal <- cbind(summary_table_unbal, LogReg_Nut = summary_table_fun(CM))

5.3.3 Logistic Regression - unbalanced - NuBi

5.3.3.1 Fitting

set.seed(12)
logr_NuBi_unbal_fit <- train(rating_bin ~.,
                data = train_bin_NuBi,
                method = "glm",
                family = "binomial",
                trControl = trCtrl)
logr_NuBi_unbal_fit
#> Generalized Linear Model 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results:
#> 
#>   Accuracy  Kappa  
#>   0.5971    0.04562

5.3.3.2 Predictions

Once again most obs are classified as good.

logr_NuBi_unbal_pred <- predict(logr_NuBi_unbal_fit, newdata = test_bin_NuBi, type = "raw")

CM <- confusionMatrix(reference = test_bin_NuBi$rating_bin, data = logr_NuBi_unbal_pred, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1399  950
#>       bad   103   88
#>                                         
#>                Accuracy : 0.585         
#>                  95% CI : (0.566, 0.605)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.734         
#>                                         
#>                   Kappa : 0.019         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.9314        
#>             Specificity : 0.0848        
#>          Pos Pred Value : 0.5956        
#>          Neg Pred Value : 0.4607        
#>              Prevalence : 0.5913        
#>          Detection Rate : 0.5508        
#>    Detection Prevalence : 0.9248        
#>       Balanced Accuracy : 0.5081        
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_unbal <- cbind(summary_table_unbal, LogReg_NuBi = summary_table_fun(CM))

5.3.4 Logistic Regression - unbalanced - NuTo

5.3.4.1 Fitting

set.seed(12)
logr_NuTo_unbal_fit <- train(rating_bin ~.,
                data = train_bin_NuTo,
                method = "glm",
                family = "binomial",
                trControl = trCtrl)
logr_NuTo_unbal_fit
#> Generalized Linear Model 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results:
#> 
#>   Accuracy  Kappa  
#>   0.595     0.03911

5.3.4.2 Predictions

Once again most obs are classified as good.

logr_NuTo_unbal_pred <- predict(logr_NuTo_unbal_fit, newdata = test_bin_NuTo, type = "raw")

CM <- confusionMatrix(reference = test_bin_NuTo$rating_bin, data = logr_NuTo_unbal_pred, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1401  942
#>       bad   101   96
#>                                        
#>                Accuracy : 0.589        
#>                  95% CI : (0.57, 0.609)
#>     No Information Rate : 0.591        
#>     P-Value [Acc > NIR] : 0.588        
#>                                        
#>                   Kappa : 0.029        
#>                                        
#>  Mcnemar's Test P-Value : <2e-16       
#>                                        
#>             Sensitivity : 0.9328       
#>             Specificity : 0.0925       
#>          Pos Pred Value : 0.5980       
#>          Neg Pred Value : 0.4873       
#>              Prevalence : 0.5913       
#>          Detection Rate : 0.5516       
#>    Detection Prevalence : 0.9224       
#>       Balanced Accuracy : 0.5126       
#>                                        
#>        'Positive' Class : good         
#> 
summary_table_unbal <- cbind(summary_table_unbal, LogReg_NuTo = summary_table_fun(CM))

5.3.5 KNN - unbalanced - Nut

5.3.5.1 Fitting

Best K if we look at both accuracy and Kappa is 111.

set.seed(12)
knn_Nut_unbal_fit <- caret::train(rating_bin ~.,
                data = train_bin_Nut,
                method = "knn",
                trControl = trCtrl,
                tuneGrid= data.frame(k = seq(105, 115,by = 2))#initially checked on 1-151 range, best was 111
                )
knn_Nut_unbal_fit
#> k-Nearest Neighbors 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results across tuning parameters:
#> 
#>   k    Accuracy  Kappa  
#>   105  0.5890    0.03661
#>   107  0.5876    0.03221
#>   109  0.5881    0.03268
#>   111  0.5885    0.03322
#>   113  0.5895    0.03570
#>   115  0.5881    0.03235
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was k = 113.

5.3.5.2 Predictions

We see that the relatively high accuracy is achieved by classifying most obs as good, resulting in a high sensitivity but a very poor specificity.

knn_Nut_unbal_pred <- predict(knn_Nut_unbal_fit, newdata = test_bin_Nut, type = "raw")

CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = knn_Nut_unbal_pred, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1382  903
#>       bad   120  135
#>                                         
#>                Accuracy : 0.597         
#>                  95% CI : (0.578, 0.616)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.279         
#>                                         
#>                   Kappa : 0.057         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.920         
#>             Specificity : 0.130         
#>          Pos Pred Value : 0.605         
#>          Neg Pred Value : 0.529         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.544         
#>    Detection Prevalence : 0.900         
#>       Balanced Accuracy : 0.525         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_unbal <- cbind(summary_table_unbal, KNN_Nut = summary_table_fun(CM))

5.3.6 KNN - unbalanced - NuBi

5.3.6.1 Fitting

Best K if we look at both accuracy and Kappa is 135

set.seed(12)
knn_NuBi_unbal_fit <- caret::train(rating_bin ~.,
                data = train_bin_NuBi,
                method = "knn",
                trControl = trCtrl,
                tuneGrid= data.frame(k = seq(131, 141,by = 2))#initially checked on 1-151 range, best was 135
                )
knn_NuBi_unbal_fit
#> k-Nearest Neighbors 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results across tuning parameters:
#> 
#>   k    Accuracy  Kappa  
#>   131  0.5954    0.04616
#>   133  0.5936    0.04216
#>   135  0.5935    0.04255
#>   137  0.5946    0.04432
#>   139  0.5948    0.04454
#>   141  0.5943    0.04332
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was k = 131.

5.3.6.2 Predictions

Once again most obs are classified as good. Not better than NIR.

knn_NuBi_unbal_pred <- predict(knn_NuBi_unbal_fit, newdata = test_bin_NuBi, type = "raw")

CM <- confusionMatrix(reference = test_bin_NuBi$rating_bin, data = knn_NuBi_unbal_pred, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1353  908
#>       bad   149  130
#>                                         
#>                Accuracy : 0.584         
#>                  95% CI : (0.564, 0.603)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.785         
#>                                         
#>                   Kappa : 0.029         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.901         
#>             Specificity : 0.125         
#>          Pos Pred Value : 0.598         
#>          Neg Pred Value : 0.466         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.533         
#>    Detection Prevalence : 0.890         
#>       Balanced Accuracy : 0.513         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_unbal <- cbind(summary_table_unbal, KNN_NuBi = summary_table_fun(CM))

5.3.7 KNN - unbalanced - NuTo

5.3.7.1 Fitting

Best K if we look at both accuracy and Kappa is 135

set.seed(12)
knn_NuTo_unbal_fit <- caret::train(rating_bin ~.,
                data = train_bin_NuTo,
                method = "knn",
                trControl = trCtrl,
                tuneGrid= data.frame(k = seq(131, 141,by = 2))#initially checked on 1-151 range, best was 135
                )
knn_NuTo_unbal_fit
#> k-Nearest Neighbors 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results across tuning parameters:
#> 
#>   k    Accuracy  Kappa  
#>   131  0.5907    0.03996
#>   133  0.5920    0.04324
#>   135  0.5925    0.04351
#>   137  0.5931    0.04508
#>   139  0.5929    0.04538
#>   141  0.5928    0.04556
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was k = 137.

5.3.7.2 Predictions

Once again most obs are classified as good. Not better than NIR.

knn_NuTo_unbal_pred <- predict(knn_NuTo_unbal_fit, newdata = test_bin_NuTo, type = "raw")

CM <- confusionMatrix(reference = test_bin_NuTo$rating_bin, data = knn_NuTo_unbal_pred, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1359  910
#>       bad   143  128
#>                                         
#>                Accuracy : 0.585         
#>                  95% CI : (0.566, 0.605)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.734         
#>                                         
#>                   Kappa : 0.032         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.905         
#>             Specificity : 0.123         
#>          Pos Pred Value : 0.599         
#>          Neg Pred Value : 0.472         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.535         
#>    Detection Prevalence : 0.893         
#>       Balanced Accuracy : 0.514         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_unbal <- cbind(summary_table_unbal, KNN_NuTo = summary_table_fun(CM))

5.3.8 CART - unbalanced - Nut

5.3.8.1 Fitting

#CART_grid <-  expand.grid(cp = c(0.001,0.005,0.01, 0.015, 0.02, 0.05, 0.1, 0.15, 0.2)) --> confirmed that highest kappa is at cp selected by caret
set.seed(12)
cart_Nut_unbal_fit <- train(rating_bin ~ .,
                  data = train_bin_Nut,
                  method = "rpart",
                  trControl = trCtrl,
                  tuneLength = 10)
cart_Nut_unbal_fit
#> CART 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results across tuning parameters:
#> 
#>   cp        Accuracy  Kappa  
#>   0.001284  0.5838    0.05640
#>   0.001445  0.5863    0.05858
#>   0.001605  0.5891    0.05943
#>   0.001734  0.5883    0.05246
#>   0.001819  0.5880    0.05409
#>   0.001926  0.5880    0.05189
#>   0.002033  0.5893    0.05045
#>   0.002568  0.5872    0.02484
#>   0.002729  0.5911    0.02326
#>   0.003050  0.5932    0.01641
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was cp = 0.00305.
# Get the results for each CP step
cart_Nut_unbal_results <- cart_Nut_unbal_fit$results

cart_Nut_unbal_results %>% 
  ggplot(aes(x=cp, y=Accuracy)) +
  geom_line()+
  labs(x = "CP Parameter", y = "Accuracy", title = "CP Plot")

5.3.8.2 Predictions

#Now using the cv to predict test set
cart_Nut_unbal_pred <- predict(cart_Nut_unbal_fit, newdata = test_bin_Nut)
#And looking at the confusion matrix for the predictions using the cv 
CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = cart_Nut_unbal_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1502 1038
#>       bad     0    0
#>                                         
#>                Accuracy : 0.591         
#>                  95% CI : (0.572, 0.611)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.509         
#>                                         
#>                   Kappa : 0             
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 1.000         
#>             Specificity : 0.000         
#>          Pos Pred Value : 0.591         
#>          Neg Pred Value :   NaN         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.591         
#>    Detection Prevalence : 1.000         
#>       Balanced Accuracy : 0.500         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_unbal <- cbind(summary_table_unbal, CART_Nut = summary_table_fun(CM))

5.3.9 CART - unbalanced - NuBi

5.3.9.1 Fitting

#CART_grid <-  expand.grid(cp = c(0.001,0.005,0.01, 0.015, 0.02, 0.05, 0.1, 0.15, 0.2)) --> confirmed that highest kappa is at cp selected by caret
set.seed(12)
cart_NuBi_unbal_fit <- train(rating_bin ~ .,
                  data = train_bin_NuBi,
                  method = "rpart",
                  trControl = trCtrl,
                  tuneLength = 10)
cart_NuBi_unbal_fit
#> CART 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results across tuning parameters:
#> 
#>   cp        Accuracy  Kappa   
#>   0.001284  0.5839    0.049008
#>   0.001364  0.5840    0.048006
#>   0.001445  0.5838    0.047922
#>   0.001498  0.5830    0.045608
#>   0.001846  0.5852    0.049754
#>   0.002354  0.5859    0.043717
#>   0.002889  0.5877    0.046967
#>   0.003210  0.5877    0.043065
#>   0.005457  0.5915    0.020972
#>   0.007223  0.5894    0.009009
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was cp = 0.005457.

It might seem like the CP chosen by caret is not optimal since the accuracy curve continues to go up, however, we confirmed by using a manual tuning grid that this CP was indeed the best when combining the accuracy with the Kappa metric.

# Get the results for each CP step
cart_NuBi_unbal_results <- cart_NuBi_unbal_fit$results

cart_NuBi_unbal_results %>% 
  ggplot(aes(x=cp, y=Accuracy)) +
  geom_line()+
  labs(x = "CP Parameter", y = "Accuracy", title = "CP Plot")

5.3.9.2 Predictions

#Now using the cv to predict test set
cart_NuBi_unbal_pred <- predict(cart_NuBi_unbal_fit, newdata = test_bin_NuBi)
#And looking at the confusion matrix for the predictions using the cv 
CM <- confusionMatrix(reference = test_bin_NuBi$rating_bin, data = cart_NuBi_unbal_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1377  919
#>       bad   125  119
#>                                        
#>                Accuracy : 0.589        
#>                  95% CI : (0.57, 0.608)
#>     No Information Rate : 0.591        
#>     P-Value [Acc > NIR] : 0.604        
#>                                        
#>                   Kappa : 0.036        
#>                                        
#>  Mcnemar's Test P-Value : <2e-16       
#>                                        
#>             Sensitivity : 0.917        
#>             Specificity : 0.115        
#>          Pos Pred Value : 0.600        
#>          Neg Pred Value : 0.488        
#>              Prevalence : 0.591        
#>          Detection Rate : 0.542        
#>    Detection Prevalence : 0.904        
#>       Balanced Accuracy : 0.516        
#>                                        
#>        'Positive' Class : good         
#> 
summary_table_unbal <- cbind(summary_table_unbal, CART_NuBi = summary_table_fun(CM))

5.3.10 CART - unbalanced - NuTo

5.3.10.1 Fitting

#CART_grid <-  expand.grid(cp = c(0.001,0.005,0.01, 0.015, 0.02, 0.05, 0.1, 0.15, 0.2)) --> confirmed that highest kappa is at cp selected by caret
set.seed(12)
cart_NuTo_unbal_fit <- train(rating_bin ~ .,
                  data = train_bin_NuTo,
                  method = "rpart",
                  trControl = trCtrl,
                  tuneLength = 10)
cart_NuTo_unbal_fit
#> CART 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Resampling results across tuning parameters:
#> 
#>   cp        Accuracy  Kappa  
#>   0.001445  0.5878    0.06262
#>   0.001498  0.5861    0.05752
#>   0.001605  0.5881    0.05903
#>   0.001766  0.5885    0.06038
#>   0.001926  0.5870    0.05717
#>   0.002247  0.5876    0.04742
#>   0.002408  0.5886    0.04678
#>   0.002568  0.5886    0.04678
#>   0.003050  0.5881    0.05055
#>   0.004173  0.5911    0.02112
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was cp = 0.004173.

It might seem like the CP chosen by caret is not optimal since the accuracy curve continues to go up, however, we confirmed by using a manual tuning grid that this CP was indeed the best when combining the accuracy with the Kappa metric.

# Get the results for each CP step
cart_NuTo_unbal_results <- cart_NuTo_unbal_fit$results

cart_NuTo_unbal_results %>% 
  ggplot(aes(x=cp, y=Accuracy)) +
  geom_line()+
  labs(x = "CP Parameter", y = "Accuracy", title = "CP Plot")

5.3.10.2 Predictions

#Now using the cv to predict test set
cart_NuTo_unbal_pred <- predict(cart_NuTo_unbal_fit, newdata = test_bin_NuTo)
#And looking at the confusion matrix for the predictions using the cv 
CM <- confusionMatrix(reference = test_bin_NuTo$rating_bin, data = cart_NuTo_unbal_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good  bad
#>       good 1502 1038
#>       bad     0    0
#>                                         
#>                Accuracy : 0.591         
#>                  95% CI : (0.572, 0.611)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.509         
#>                                         
#>                   Kappa : 0             
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 1.000         
#>             Specificity : 0.000         
#>          Pos Pred Value : 0.591         
#>          Neg Pred Value :   NaN         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.591         
#>    Detection Prevalence : 1.000         
#>       Balanced Accuracy : 0.500         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_unbal <- cbind(summary_table_unbal, CART_NuTo = summary_table_fun(CM))

5.4 1.2. 33 Engineered variables - balanced

#creating empty summary table for balanced data
summary_table_down <- tibble(metric = c("accuracy", "balanced_accuracy", "sensitivity", "kappa"))

5.4.1 Random Forest - balanced - Nut

Hyperparameters

Best tune: mtry = 2

Accuracy metrics

  • ACC 53.8
  • BACC 53.8
  • Kappa 7.40
  • Recall 53.9

5.4.1.1 Fitting

set.seed(12)
rf_Nut_fit <- caret::train(rating_bin ~ .,
                     data = train_bin_Nut,
                     method = "rf",
                     trControl = trCtrl_down,
                     tuneLength = 10)
#> note: only 3 unique complexity parameters in default grid. Truncating the grid to 3 .
#show the random forest with cv 
rf_Nut_fit
#> Random Forest 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa  
#>   2     0.5310    0.06036
#>   3     0.5247    0.05028
#>   4     0.5229    0.04546
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was mtry = 2.
plot(rf_Nut_fit, main = "Plot of CV Accuracy relative to Mtry parameter")

5.4.1.2 Predictions

As can be seen by the predictions below, our assumptions were valid. The model does a really bad job at classifying the test set into the correct rating categories. It is equivalent to a random model with a very strong bias towards the majority class at 4.375 (due to the data imbalance present in the rating variable.)

set.seed(12)
#make predictions
rf_Nut_pred <- predict(rf_Nut_fit, newdata = test_bin_Nut)

#confusion matrix 
CM <- confusionMatrix(data = rf_Nut_pred, reference = test_bin_Nut$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  806 480
#>       bad   696 558
#>                                         
#>                Accuracy : 0.537         
#>                  95% CI : (0.517, 0.557)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.072         
#>                                         
#>  Mcnemar's Test P-Value : 3.62e-10      
#>                                         
#>             Sensitivity : 0.537         
#>             Specificity : 0.538         
#>          Pos Pred Value : 0.627         
#>          Neg Pred Value : 0.445         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.317         
#>    Detection Prevalence : 0.506         
#>       Balanced Accuracy : 0.537         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_down <- cbind(summary_table_down, RF_Nut = summary_table_fun(CM))

5.4.2 Random Forest - balanced - NuBi

Hyperparameters

Best tune: mtry = 3

Accuracy metrics

  • ACC 54.8
  • BACC 54.2
  • Kappa 8.20
  • Recall 57.7

5.4.2.1 Fitting

set.seed(12)
rf_NuBi_fit <- caret::train(rating_bin ~ .,
                     data = train_bin_NuBi,
                     method = "rf",
                     trControl = trCtrl_down,
                     tuneLength = 10)
#show the random forest with cv 
rf_NuBi_fit
#> Random Forest 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa  
#>    2    0.5512    0.09826
#>    3    0.5569    0.10237
#>    5    0.5490    0.09099
#>    7    0.5452    0.08598
#>    9    0.5472    0.08937
#>   10    0.5478    0.08876
#>   12    0.5482    0.09350
#>   14    0.5437    0.08312
#>   16    0.5403    0.07545
#>   18    0.5423    0.08035
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was mtry = 3.
plot(rf_NuBi_fit, main = "Plot of CV Accuracy relative to Mtry parameter")

5.4.2.2 Predictions

set.seed(12)
#make predictions
rf_NuBi_pred <- predict(rf_NuBi_fit, newdata = test_bin_NuBi)

#confusion matrix 
CM <- confusionMatrix(data = rf_NuBi_pred, reference = test_bin_NuBi$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  867 513
#>       bad   635 525
#>                                         
#>                Accuracy : 0.548         
#>                  95% CI : (0.528, 0.568)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.999995      
#>                                         
#>                   Kappa : 0.082         
#>                                         
#>  Mcnemar's Test P-Value : 0.000355      
#>                                         
#>             Sensitivity : 0.577         
#>             Specificity : 0.506         
#>          Pos Pred Value : 0.628         
#>          Neg Pred Value : 0.453         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.341         
#>    Detection Prevalence : 0.543         
#>       Balanced Accuracy : 0.542         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_down <- cbind(summary_table_down, RF_NuBi = summary_table_fun(CM))

5.4.3 Random Forest - balanced - NuTo

Hyperparameters

Best tune: mtry = 2

Accuracy metrics

  • ACC 56.4
  • BACC 55.8
  • Kappa 11.3
  • Recall 59.3

5.4.3.1 Fitting

set.seed(12)
rf_NuTo_fit <- caret::train(rating_bin ~ .,
                     data = train_bin_NuTo,
                     method = "rf",
                     trControl = trCtrl_down,
                     tuneLength = 10)
#show the random forest with cv 
rf_NuTo_fit
#> Random Forest 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa  
#>    2    0.5586    0.10858
#>    3    0.5545    0.09957
#>    5    0.5515    0.09524
#>    7    0.5511    0.09567
#>    9    0.5498    0.09433
#>   10    0.5521    0.09823
#>   12    0.5546    0.10354
#>   14    0.5468    0.08888
#>   16    0.5504    0.09405
#>   18    0.5482    0.09194
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was mtry = 2.
plot(rf_NuTo_fit, main = "Plot of CV Accuracy relative to Mtry parameter")

5.4.3.2 Predictions

set.seed(12)
#make predictions
rf_NuTo_pred <- predict(rf_NuTo_fit, newdata = test_bin_NuTo)

#confusion matrix 
CM <- confusionMatrix(data = rf_NuTo_pred, reference = test_bin_NuTo$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  887 496
#>       bad   615 542
#>                                         
#>                Accuracy : 0.563         
#>                  95% CI : (0.543, 0.582)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.9985        
#>                                         
#>                   Kappa : 0.111         
#>                                         
#>  Mcnemar's Test P-Value : 0.0004        
#>                                         
#>             Sensitivity : 0.591         
#>             Specificity : 0.522         
#>          Pos Pred Value : 0.641         
#>          Neg Pred Value : 0.468         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.349         
#>    Detection Prevalence : 0.544         
#>       Balanced Accuracy : 0.556         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_down <- cbind(summary_table_down, RF_NuTo = summary_table_fun(CM))

5.4.4 Random Forest - balanced - Tot

Hyperparameters

Best tune: mtry = 3

Accuracy metrics

  • ACC 52.5
  • BACC 52.4
  • Kappa 4.6
  • Recall 52.9

5.4.4.1 Fitting

set.seed(12)
rf_Tot_fit <- caret::train(rating_bin ~ .,
                     data = train_bin_Tot,
                     method = "rf",
                     preProcess = NULL,
                     trControl = trCtrl_down,
                     tuneLength = 10)
#show the random forest with cv 
rf_Tot_fit
#> Random Forest 
#> 
#> 7623 samples
#>   14 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa  
#>    2    0.5334    0.06957
#>    3    0.5407    0.07430
#>    4    0.5352    0.06709
#>    6    0.5357    0.06395
#>    7    0.5309    0.05548
#>    8    0.5292    0.04827
#>   10    0.5291    0.05021
#>   11    0.5294    0.04991
#>   12    0.5283    0.04822
#>   14    0.5298    0.04841
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was mtry = 3.
plot(rf_Tot_fit, main = "Plot of CV Accuracy relative to Mtry parameter")

5.4.4.2 Predictions

set.seed(12)
#make predictions
rf_Tot_pred <- predict(rf_Tot_fit, newdata = test_bin_Tot)

#confusion matrix 
CM <- confusionMatrix(data = rf_Tot_pred, reference = test_bin_Tot$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  794 498
#>       bad   708 540
#>                                         
#>                Accuracy : 0.525         
#>                  95% CI : (0.506, 0.545)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.047         
#>                                         
#>  Mcnemar's Test P-Value : 1.76e-09      
#>                                         
#>             Sensitivity : 0.529         
#>             Specificity : 0.520         
#>          Pos Pred Value : 0.615         
#>          Neg Pred Value : 0.433         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.313         
#>    Detection Prevalence : 0.509         
#>       Balanced Accuracy : 0.524         
#>                                         
#>        'Positive' Class : good          
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, RF_Tot = summary_table_fun(CM))

5.4.5 Random Forest - balanced - Bin

Hyperparameters

Best tune: mtry = 2

Accuracy metrics

  • ACC 52.9
  • BACC 52.4
  • Kappa 4.6
  • Sens 55.4

5.4.5.1 Fitting

set.seed(12)
rf_Bin_fit <- caret::train(rating_bin ~ .,
                     data = train_bin_Bin,
                     method = "rf",
                     preProcess = NULL,
                     trControl = trCtrl_down,
                     tuneLength = 10)
#show the random forest with cv 
rf_Bin_fit
#> Random Forest 
#> 
#> 7623 samples
#>   14 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa  
#>    2    0.5384    0.07264
#>    3    0.5297    0.05937
#>    4    0.5309    0.06323
#>    6    0.5317    0.06704
#>    7    0.5277    0.05938
#>    8    0.5263    0.05836
#>   10    0.5251    0.05631
#>   11    0.5258    0.05787
#>   12    0.5331    0.06577
#>   14    0.5310    0.05715
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was mtry = 2.
plot(rf_Bin_fit, main = "Plot of CV Accuracy relative to Mtry parameter")

5.4.5.2 Predictions

set.seed(12)
#make predictions
rf_Bin_pred <- predict(rf_Bin_fit, newdata = test_bin_Bin)

#confusion matrix 
CM <- confusionMatrix(data = rf_Bin_pred, reference = test_bin_Bin$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  837 524
#>       bad   665 514
#>                                         
#>                Accuracy : 0.532         
#>                  95% CI : (0.512, 0.551)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.051         
#>                                         
#>  Mcnemar's Test P-Value : 4.91e-05      
#>                                         
#>             Sensitivity : 0.557         
#>             Specificity : 0.495         
#>          Pos Pred Value : 0.615         
#>          Neg Pred Value : 0.436         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.330         
#>    Detection Prevalence : 0.536         
#>       Balanced Accuracy : 0.526         
#>                                         
#>        'Positive' Class : good          
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, RF_Bin = summary_table_fun(CM))

We see that the 3 best performing models are: Nut, NuBi and NuTo. Which was expected given the importance of the nutritional values that we saw in the PCA during the EDA. Therefore, we decide to test further models only with those 3 combinations of variables, thus leaving out models trained solely on Tot or on Bin.

5.4.6 Logistic Regression - balanced - Nut

5.4.6.1 Fitting

set.seed(12)
logr_Nut_blncd <- caret::train(rating_bin ~.,
                data = train_bin_Nut,
                method = "glm",
                family = "binomial",
                trControl = trCtrl_down)
logr_Nut_blncd
#> Generalized Linear Model 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results:
#> 
#>   Accuracy  Kappa  
#>   0.4957    0.04683

We observe that the accuracy of training set is 49.6%.

5.4.6.2 Predictions

logr_Nut_pred_blncd <- predict(logr_Nut_blncd, newdata = test_bin_Nut, type = "raw")

CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = logr_Nut_pred_blncd, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  510 332
#>       bad   992 706
#>                                         
#>                Accuracy : 0.479         
#>                  95% CI : (0.459, 0.498)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.018         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.340         
#>             Specificity : 0.680         
#>          Pos Pred Value : 0.606         
#>          Neg Pred Value : 0.416         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.201         
#>    Detection Prevalence : 0.331         
#>       Balanced Accuracy : 0.510         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_down <- cbind(summary_table_down, LogReg_Nut = summary_table_fun(CM))

Here we notice that the accuracy of the test set stands at 47.9%

5.4.7 Logistic Regression - balanced - NuBi

5.4.7.1 Fitting

set.seed(12)
logr_NuBi_blncd <- caret::train(rating_bin ~.,
                data = train_bin_NuBi,
                method = "glm",
                family = "binomial",
                trControl = trCtrl_down)
logr_NuBi_blncd
#> Generalized Linear Model 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results:
#> 
#>   Accuracy  Kappa  
#>   0.539     0.07101

We observe that the accuracy of the training set stands at 53.9%.

5.4.7.2 Predictions

logr_NuBi_pred_blncd <- predict(logr_NuBi_blncd, newdata = test_bin_NuBi, type = "raw")

CM <- confusionMatrix(reference = test_bin_NuBi$rating_bin, data = logr_NuBi_pred_blncd, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  830 480
#>       bad   672 558
#>                                         
#>                Accuracy : 0.546         
#>                  95% CI : (0.527, 0.566)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.088         
#>                                         
#>  Mcnemar's Test P-Value : 1.83e-08      
#>                                         
#>             Sensitivity : 0.553         
#>             Specificity : 0.538         
#>          Pos Pred Value : 0.634         
#>          Neg Pred Value : 0.454         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.327         
#>    Detection Prevalence : 0.516         
#>       Balanced Accuracy : 0.545         
#>                                         
#>        'Positive' Class : good          
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, LogReg_NuBi = summary_table_fun(CM))

We want to evaluate the quality of the model and we notice that the accuracy of the test set stands at 54.6%

5.4.8 Logistic Regression - balanced - NuTo

5.4.8.1 Fitting

set.seed(12)
logr_NuTo_blncd <- caret::train(rating_bin ~.,
                data = train_bin_NuTo,
                method = "glm",
                family = "binomial",
                trControl = trCtrl_down)
logr_NuTo_blncd
#> Generalized Linear Model 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results:
#> 
#>   Accuracy  Kappa  
#>   0.5409    0.07679

We observe that the accuracy of the training set stands at 54.1%.

5.4.8.2 Predictions

logr_NuTo_pred_blncd <- predict(logr_NuTo_blncd, newdata = test_bin_NuTo, type = "raw")

CM <- confusionMatrix(reference = test_bin_NuTo$rating_bin, data = logr_NuTo_pred_blncd, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  859 472
#>       bad   643 566
#>                                        
#>                Accuracy : 0.561        
#>                  95% CI : (0.541, 0.58)
#>     No Information Rate : 0.591        
#>     P-Value [Acc > NIR] : 0.999        
#>                                        
#>                   Kappa : 0.114        
#>                                        
#>  Mcnemar's Test P-Value : 3.56e-07     
#>                                        
#>             Sensitivity : 0.572        
#>             Specificity : 0.545        
#>          Pos Pred Value : 0.645        
#>          Neg Pred Value : 0.468        
#>              Prevalence : 0.591        
#>          Detection Rate : 0.338        
#>    Detection Prevalence : 0.524        
#>       Balanced Accuracy : 0.559        
#>                                        
#>        'Positive' Class : good         
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, LogReg_NuTo = summary_table_fun(CM))

We want to evaluate the quality of the model and we notice that the accuracy of the test set stands at 56.1%

5.4.9 SVM - balanced - Nut

5.4.9.1 Linear SVM - balanced - Nut

svm_Linear_nutritional <- train(rating_bin ~ calories + protein + fat + sodium, data=train_bin_Nut, method = "svmLinear", trControl=trCtrl_down)
svm_Linear_nutritional

The validation accuracy stands at 42.8%, which is not satisfying considering that it is computed on the training set. The next step consists in creating a grid of values for the cost that we want to try and pass to the argument tuneGrid.

5.4.9.1.1 Tuning the Hyperparameters - Linear basis SVM
grid <- expand.grid(C = c(0.01, 0.1, 1, 10, 100, 1000))
grid
#>       C
#> 1 1e-02
#> 2 1e-01
#> 3 1e+00
#> 4 1e+01
#> 5 1e+02
#> 6 1e+03
svm_Linear_Grid_nutritional <- train(rating_bin ~., data=train_bin_Nut, method = "svmLinear", trControl=trCtrl_down, tuneGrid = grid)
svm_Linear_Grid_nutritional
#> Support Vector Machines with Linear Kernel 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6098, 6098, 6098, 6099, 6099 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   C      Accuracy  Kappa  
#>   1e-02  0.4274    0.01589
#>   1e-01  0.4270    0.01451
#>   1e+00  0.4277    0.01512
#>   1e+01  0.4269    0.01484
#>   1e+02  0.4294    0.01739
#>   1e+03  0.4300    0.01765
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was C = 1000.
plot(svm_Linear_Grid_nutritional)

svm_Linear_Grid_nutritional$bestTune
#>      C
#> 6 1000

The result indicates that setting the cost to C=10 provides the best model with accuracy=42.9%. The accuracy apparently reaches a plateau at this value. There is no sensible improvement compared to the previous linear SVM with default parameter cost C=1.

5.4.9.2 Radial SVM - balanced - Nut

svm_Radial_nutritional <- train(rating_bin ~., data=train_bin_Nut, method = "svmRadial", trControl=trCtrl_down)
svm_Radial_nutritional

The validation accuracy stands at 54.8%, which is substantially better than the SVM model with the linear kernel. The next step consists in creating a grid of values for the cost that we want to try and pass to the argument tuneGrid.

5.4.9.2.1 Tuning the Hyperparameters - Radial basis SVM
grid_radial <- expand.grid(sigma = c(0.01, 0.02, 0.05, 0.1),
                           C = c(1, 10, 100, 500, 1000))
grid_radial
#>    sigma    C
#> 1   0.01    1
#> 2   0.02    1
#> 3   0.05    1
#> 4   0.10    1
#> 5   0.01   10
#> 6   0.02   10
#> 7   0.05   10
#> 8   0.10   10
#> 9   0.01  100
#> 10  0.02  100
#> 11  0.05  100
#> 12  0.10  100
#> 13  0.01  500
#> 14  0.02  500
#> 15  0.05  500
#> 16  0.10  500
#> 17  0.01 1000
#> 18  0.02 1000
#> 19  0.05 1000
#> 20  0.10 1000
svm_Radial_Grid_nutritional <- train(rating_bin ~., data=train_bin_Nut, method = "svmRadial", trControl=trCtrl_down, tuneGrid = grid_radial)
svm_Radial_Grid_nutritional
#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6098, 6098, 6098, 6099, 6099 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   sigma  C     Accuracy  Kappa  
#>   0.01      1  0.4639    0.02754
#>   0.01     10  0.4795    0.03602
#>   0.01    100  0.4837    0.04114
#>   0.01    500  0.4973    0.05518
#>   0.01   1000  0.5106    0.06371
#>   0.02      1  0.4695    0.03141
#>   0.02     10  0.4812    0.03930
#>   0.02    100  0.5039    0.05635
#>   0.02    500  0.5100    0.06460
#>   0.02   1000  0.5132    0.06045
#>   0.05      1  0.4859    0.03997
#>   0.05     10  0.5016    0.05040
#>   0.05    100  0.5203    0.07105
#>   0.05    500  0.5154    0.06995
#>   0.05   1000  0.5259    0.08091
#>   0.10      1  0.4925    0.04567
#>   0.10     10  0.5144    0.06570
#>   0.10    100  0.5176    0.07403
#>   0.10    500  0.5226    0.07953
#>   0.10   1000  0.5264    0.07750
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final values used for the model were sigma = 0.1 and C = 1000.
plot(svm_Radial_Grid_nutritional)

svm_Radial_Grid_nutritional$bestTune
#>    sigma    C
#> 20   0.1 1000

The optimal model from this search is with sigma = 0.1 and C = 1000 This optimal model would then reach accuracy=52.6%.

5.4.9.3 SVM Nut - Radial Kernel Best model

# After finding the best hyperparameters, we re-train the model with the best hyperparameters on the entire training set. Afterwards we will evaluate the model on the test set.

grid_radial_best <- expand.grid(sigma = 0.1, C = 1000)


recipe_rb_tuned_nutritional <- train(rating_bin ~., data=train_bin_Nut, method = "svmRadial", trControl=trCtrl_down, tuneGrid = grid_radial_best)
recipe_rb_tuned_pred_nutritional <- predict(svm_Radial_Grid_nutritional, newdata = test_bin_Nut)

CM <- confusionMatrix(data=recipe_rb_tuned_pred_nutritional, reference = test_bin_Nut$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  671 372
#>       bad   831 666
#>                                         
#>                Accuracy : 0.526         
#>                  95% CI : (0.507, 0.546)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.083         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.447         
#>             Specificity : 0.642         
#>          Pos Pred Value : 0.643         
#>          Neg Pred Value : 0.445         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.264         
#>    Detection Prevalence : 0.411         
#>       Balanced Accuracy : 0.544         
#>                                         
#>        'Positive' Class : good          
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, SVM_Radial_Nut = summary_table_fun(CM))

The result indicates that with the tuned hyperparameters on the radial basis SVM model we achieve an accuracy of 52.9% on the test set. We can conclude that among all the models, the radial basis kernel SVM with C=1000 and sigma=0.01 is the best model.

5.4.10 SVM - balanced - NuBi

5.4.10.1 Linear SVM - balanced - NuBi

svm_Linear_NuBi <- train(rating_bin ~., data=train_bin_NuBi, method = "svmLinear", trControl=trCtrl_down)
svm_Linear_NuBi

The validation accuracy stands at 56.1%. The next step consists in creating a grid of values for the cost that we want to try and pass to the argument tuneGrid.

5.4.10.1.1 Tuning the Hyperparameters - Linear basis SVM
grid <- expand.grid(C = c(0.01, 0.1, 1, 10, 100, 1000))
grid
#>       C
#> 1 1e-02
#> 2 1e-01
#> 3 1e+00
#> 4 1e+01
#> 5 1e+02
#> 6 1e+03
svm_Linear_Grid_NuBi <- train(rating_bin ~., data=train_bin_NuBi, method = "svmLinear", trControl=trCtrl_down, tuneGrid = grid)
svm_Linear_Grid_NuBi
#> Support Vector Machines with Linear Kernel 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6098, 6098, 6098, 6099, 6099 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   C      Accuracy  Kappa  
#>   1e-02  0.5605    0.05415
#>   1e-01  0.5658    0.06242
#>   1e+00  0.5578    0.05575
#>   1e+01  0.5591    0.05410
#>   1e+02  0.5702    0.05523
#>   1e+03  0.5595    0.05896
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was C = 100.
plot(svm_Linear_Grid_NuBi)

svm_Linear_Grid_NuBi$bestTune
#>     C
#> 5 100

The result indicates that setting the cost to C=0.01 provides the best model with accuracy=57.6%. There is a sensible improvement compared to the previous linear SVM with default parameter cost C=1.

5.4.10.2 Radial SVM - balanced - NuBi

svm_Radial_NuBi <- train(rating_bin ~., data=train_bin_NuBi, method = "svmRadial", trControl=trCtrl_down)
svm_Radial_NuBi

The validation accuracy stands at 53.7%, which is worse than the SVM model with the linear kernel. The next step consists in creating a grid of values for the cost that we want to try and pass to the argument tuneGrid.

5.4.10.2.1 Tuning the Hyperparameters - Radial basis SVM
grid_radial <- expand.grid(sigma = c(0.01, 0.02, 0.05, 0.1),
                           C = c(1, 10, 100, 500, 1000))
grid_radial
#>    sigma    C
#> 1   0.01    1
#> 2   0.02    1
#> 3   0.05    1
#> 4   0.10    1
#> 5   0.01   10
#> 6   0.02   10
#> 7   0.05   10
#> 8   0.10   10
#> 9   0.01  100
#> 10  0.02  100
#> 11  0.05  100
#> 12  0.10  100
#> 13  0.01  500
#> 14  0.02  500
#> 15  0.05  500
#> 16  0.10  500
#> 17  0.01 1000
#> 18  0.02 1000
#> 19  0.05 1000
#> 20  0.10 1000
svm_Radial_Grid_NuBi <- train(rating_bin ~., data=train_bin_NuBi, method = "svmRadial", trControl=trCtrl_down, tuneGrid = grid_radial)
svm_Radial_Grid_NuBi
#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6098, 6098, 6099, 6098, 6099 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   sigma  C     Accuracy  Kappa  
#>   0.01      1  0.5436    0.07849
#>   0.01     10  0.5241    0.06469
#>   0.01    100  0.5280    0.06961
#>   0.01    500  0.5221    0.05656
#>   0.01   1000  0.5253    0.06514
#>   0.02      1  0.5355    0.06911
#>   0.02     10  0.5229    0.06411
#>   0.02    100  0.5170    0.04806
#>   0.02    500  0.5242    0.05930
#>   0.02   1000  0.5288    0.06555
#>   0.05      1  0.5298    0.07200
#>   0.05     10  0.5228    0.05749
#>   0.05    100  0.5302    0.06464
#>   0.05    500  0.5263    0.05616
#>   0.05   1000  0.5233    0.04928
#>   0.10      1  0.5249    0.05581
#>   0.10     10  0.5246    0.05390
#>   0.10    100  0.5212    0.04840
#>   0.10    500  0.5275    0.06150
#>   0.10   1000  0.5232    0.05671
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final values used for the model were sigma = 0.01 and C = 1.
plot(svm_Radial_Grid_NuBi)

svm_Radial_Grid_NuBi$bestTune
#>   sigma C
#> 1  0.01 1

The optimal model from this search is with sigma = 0.01 and C = 1. This optimal model would then reach accuracy=54.8%.

5.4.10.3 SVM NuBi - Radial Kernel Best model

# After finding the best hyperparameters, we re-train the model with the best hyperparameters on the entire training set. Afterwards we will evaluate the model on the test set.

grid_radial_best <- expand.grid(sigma = 0.01, C = 1)


recipe_rb_tuned_NuBi <- train(rating_bin ~., data=train_bin_NuBi, method = "svmRadial", trControl=trCtrl_down, tuneGrid = grid_radial_best)
recipe_rb_tuned_pred_NuBi <- predict(svm_Radial_Grid_NuBi, newdata = test_bin_NuBi)

CM <- confusionMatrix(data=recipe_rb_tuned_pred_NuBi, reference = test_bin_NuBi$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  815 501
#>       bad   687 537
#>                                         
#>                Accuracy : 0.532         
#>                  95% CI : (0.513, 0.552)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.058         
#>                                         
#>  Mcnemar's Test P-Value : 7.99e-08      
#>                                         
#>             Sensitivity : 0.543         
#>             Specificity : 0.517         
#>          Pos Pred Value : 0.619         
#>          Neg Pred Value : 0.439         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.321         
#>    Detection Prevalence : 0.518         
#>       Balanced Accuracy : 0.530         
#>                                         
#>        'Positive' Class : good          
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, SVM_Radial_NuBi = summary_table_fun(CM))

The result indicates that with the tuned hyperparameters on the radial basis SVM model we achieve an accuracy of 54.1% on the test set. We can conclude that among all the models, the radial basis kernel SVM with C=1 and sigma=0.01 is the best model.

5.4.11 SVM - balanced - NuTo

5.4.11.1 Linear SVM - balanced - NuTo

svm_Linear_NuTo <- train(rating_bin ~ calories + protein + fat + sodium, data=train_bin_NuTo, method = "svmLinear", trControl=trCtrl_down)
svm_Linear_NuTo

The validation accuracy stands at 42.8%, which is not satisfying considering that it is computed on the training set. The next step consists in creating a grid of values for the cost that we want to try and pass to the argument tuneGrid.

5.4.11.1.1 Tuning the Hyperparameters - Linear basis SVM
grid <- expand.grid(C = c(0.01, 0.1, 1, 10, 100, 1000))
grid
#>       C
#> 1 1e-02
#> 2 1e-01
#> 3 1e+00
#> 4 1e+01
#> 5 1e+02
#> 6 1e+03
svm_Linear_Grid_NuTo <- train(rating_bin ~., data=train_bin_NuTo, method = "svmLinear", trControl=trCtrl_down, tuneGrid = grid)
svm_Linear_Grid_NuTo
#> Support Vector Machines with Linear Kernel 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6098, 6099, 6098, 6099, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   C      Accuracy  Kappa  
#>   1e-02  0.5636    0.06680
#>   1e-01  0.5502    0.06580
#>   1e+00  0.5226    0.04705
#>   1e+01  0.5541    0.05594
#>   1e+02  0.5626    0.06101
#>   1e+03  0.5375    0.03216
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was C = 0.01.
plot(svm_Linear_Grid_NuTo)

svm_Linear_Grid_NuTo$bestTune
#>      C
#> 1 0.01

The result indicates that setting the cost to C=1000 provides the best model with accuracy=54%. There is a considerable improvement with respect to the previous linear SVM with default parameter cost C=1.

5.4.11.2 Radial SVM - balanced - NuTo

svm_Radial_NuTo <- train(rating_bin ~., data=train_bin_NuTo, method = "svmRadial", trControl=trCtrl_down)
svm_Radial_NuTo

The validation accuracy stands at 53.6%, which is substantially better than the SVM model with the linear kernel and default parameters. The next step consists in creating a grid of values for the cost that we want to try and pass to the argument tuneGrid.

5.4.11.2.1 Tuning the Hyperparameters - Radial basis SVM
grid_radial <- expand.grid(sigma = c(0.01, 0.02, 0.05, 0.1),
                           C = c(1, 10, 100, 500, 1000))
grid_radial
#>    sigma    C
#> 1   0.01    1
#> 2   0.02    1
#> 3   0.05    1
#> 4   0.10    1
#> 5   0.01   10
#> 6   0.02   10
#> 7   0.05   10
#> 8   0.10   10
#> 9   0.01  100
#> 10  0.02  100
#> 11  0.05  100
#> 12  0.10  100
#> 13  0.01  500
#> 14  0.02  500
#> 15  0.05  500
#> 16  0.10  500
#> 17  0.01 1000
#> 18  0.02 1000
#> 19  0.05 1000
#> 20  0.10 1000
svm_Radial_Grid_NuTo <- train(rating_bin ~., data=train_bin_NuTo, method = "svmRadial", trControl=trCtrl_down, tuneGrid = grid_radial)
svm_Radial_Grid_NuTo
#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6098, 6099, 6099, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   sigma  C     Accuracy  Kappa  
#>   0.01      1  0.5535    0.08529
#>   0.01     10  0.5403    0.09157
#>   0.01    100  0.5327    0.08438
#>   0.01    500  0.5279    0.07059
#>   0.01   1000  0.5291    0.06973
#>   0.02      1  0.5376    0.07964
#>   0.02     10  0.5347    0.08873
#>   0.02    100  0.5284    0.07144
#>   0.02    500  0.5296    0.07168
#>   0.02   1000  0.5312    0.06884
#>   0.05      1  0.5315    0.07515
#>   0.05     10  0.5262    0.06502
#>   0.05    100  0.5346    0.07739
#>   0.05    500  0.5372    0.08024
#>   0.05   1000  0.5266    0.06042
#>   0.10      1  0.5347    0.07330
#>   0.10     10  0.5355    0.07451
#>   0.10    100  0.5334    0.07182
#>   0.10    500  0.5254    0.05749
#>   0.10   1000  0.5190    0.04579
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final values used for the model were sigma = 0.01 and C = 1.
plot(svm_Radial_Grid_NuTo)

svm_Radial_Grid_NuTo$bestTune
#>   sigma C
#> 1  0.01 1

The optimal model from this search is with sigma = 0.01 and C = 1 This optimal model would then reach accuracy=53.7% .

5.4.11.3 SVM NuTo - Radial Kernel Best model

# After finding the best hyperparameters, we re-train the model with the best hyperparameters on the entire training set. Afterwards we will evaluate the model on the test set.

grid_radial_best <- expand.grid(sigma = 0.01, C = 1)


recipe_rb_tuned_NuTo <- train(rating_bin ~., data=train_bin_NuTo, method = "svmRadial", trControl=trCtrl_down, tuneGrid = grid_radial_best)
recipe_rb_tuned_pred_NuTo <- predict(svm_Radial_Grid_NuTo, newdata = test_bin_NuTo)

CM <- confusionMatrix(data=recipe_rb_tuned_pred_NuTo, reference = test_bin_NuTo$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  821 488
#>       bad   681 550
#>                                        
#>                Accuracy : 0.54         
#>                  95% CI : (0.52, 0.559)
#>     No Information Rate : 0.591        
#>     P-Value [Acc > NIR] : 1            
#>                                        
#>                   Kappa : 0.074        
#>                                        
#>  Mcnemar's Test P-Value : 1.96e-08     
#>                                        
#>             Sensitivity : 0.547        
#>             Specificity : 0.530        
#>          Pos Pred Value : 0.627        
#>          Neg Pred Value : 0.447        
#>              Prevalence : 0.591        
#>          Detection Rate : 0.323        
#>    Detection Prevalence : 0.515        
#>       Balanced Accuracy : 0.538        
#>                                        
#>        'Positive' Class : good         
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, SVM_Radial_NuTo = summary_table_fun(CM))

The result indicates that with the tuned hyperparameters on the radial basis SVM model we achieve an accuracy of 55.4% on the test set. We can conclude that among all the models, the radial basis kernel SVM with C=1 and sigma=0.01 is the best model.

5.4.12 KNN - balanced - Nut

5.4.12.1 Fitting

set.seed(12)
knn_Nut <- caret::train(rating_bin ~.,
                data = train_bin_Nut,
                method = "knn",
                trControl = trCtrl_down,
                tuneGrid= data.frame(k = seq(1, 150,by = 5))#initially checked on 1-151 range, best was 111
                )
knn_Nut
#> k-Nearest Neighbors 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   k    Accuracy  Kappa  
#>     1  0.5115    0.02034
#>     6  0.5127    0.02523
#>    11  0.5254    0.04719
#>    16  0.5205    0.03622
#>    21  0.5296    0.05524
#>    26  0.5285    0.05244
#>    31  0.5356    0.06752
#>    36  0.5327    0.05924
#>    41  0.5396    0.07569
#>    46  0.5397    0.07476
#>    51  0.5406    0.07732
#>    56  0.5367    0.07057
#>    61  0.5356    0.06750
#>    66  0.5422    0.08001
#>    71  0.5378    0.07045
#>    76  0.5405    0.07700
#>    81  0.5399    0.07641
#>    86  0.5399    0.07663
#>    91  0.5493    0.09493
#>    96  0.5441    0.08423
#>   101  0.5393    0.07589
#>   106  0.5415    0.07986
#>   111  0.5466    0.08803
#>   116  0.5401    0.07315
#>   121  0.5434    0.07961
#>   126  0.5426    0.07906
#>   131  0.5486    0.09043
#>   136  0.5508    0.09346
#>   141  0.5462    0.08639
#>   146  0.5511    0.09357
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was k = 146.

5.4.12.2 Predictions

knn_Nut_pred <- predict(knn_Nut, newdata = test_bin_Nut, type = "raw")

CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = knn_Nut_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  842 507
#>       bad   660 531
#>                                        
#>                Accuracy : 0.541        
#>                  95% CI : (0.521, 0.56)
#>     No Information Rate : 0.591        
#>     P-Value [Acc > NIR] : 1            
#>                                        
#>                   Kappa : 0.071        
#>                                        
#>  Mcnemar's Test P-Value : 8.61e-06     
#>                                        
#>             Sensitivity : 0.561        
#>             Specificity : 0.512        
#>          Pos Pred Value : 0.624        
#>          Neg Pred Value : 0.446        
#>              Prevalence : 0.591        
#>          Detection Rate : 0.331        
#>    Detection Prevalence : 0.531        
#>       Balanced Accuracy : 0.536        
#>                                        
#>        'Positive' Class : good         
#> 
summary_table_down <- cbind(summary_table_down, KNN_Nut = summary_table_fun(CM))

5.4.13 KNN - balanced - NuBi

5.4.13.1 Fitting

set.seed(12)
knn_NuBi <- caret::train(rating_bin ~.,
                data = train_bin_NuBi,
                method = "knn",
                trControl = trCtrl_down,
                tuneGrid= data.frame(k = seq(1, 150,by = 5))#initially checked on 1-151 range, best was 37
                )
knn_NuBi
#> k-Nearest Neighbors 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   k    Accuracy  Kappa  
#>     1  0.5159    0.03302
#>     6  0.5195    0.03762
#>    11  0.5175    0.03123
#>    16  0.5281    0.05469
#>    21  0.5291    0.05654
#>    26  0.5207    0.04282
#>    31  0.5249    0.05307
#>    36  0.5289    0.06206
#>    41  0.5199    0.04767
#>    46  0.5245    0.05251
#>    51  0.5239    0.05521
#>    56  0.5209    0.04855
#>    61  0.5347    0.07109
#>    66  0.5264    0.05810
#>    71  0.5371    0.07523
#>    76  0.5229    0.05468
#>    81  0.5293    0.06244
#>    86  0.5294    0.06422
#>    91  0.5315    0.07016
#>    96  0.5291    0.06254
#>   101  0.5287    0.06560
#>   106  0.5326    0.07371
#>   111  0.5293    0.06799
#>   116  0.5250    0.06135
#>   121  0.5294    0.07004
#>   126  0.5289    0.06792
#>   131  0.5305    0.07441
#>   136  0.5256    0.06340
#>   141  0.5287    0.06842
#>   146  0.5287    0.07384
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was k = 71.

5.4.13.2 Predictions

knn_NuBi_pred <- predict(knn_NuBi, newdata = test_bin_NuBi, type = "raw")

CM <- confusionMatrix(reference = test_bin_NuBi$rating_bin, data = knn_NuBi_pred, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  770 496
#>       bad   732 542
#>                                         
#>                Accuracy : 0.517         
#>                  95% CI : (0.497, 0.536)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.034         
#>                                         
#>  Mcnemar's Test P-Value : 2e-11         
#>                                         
#>             Sensitivity : 0.513         
#>             Specificity : 0.522         
#>          Pos Pred Value : 0.608         
#>          Neg Pred Value : 0.425         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.303         
#>    Detection Prevalence : 0.498         
#>       Balanced Accuracy : 0.517         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_down <- cbind(summary_table_down, KNN_NuBi = summary_table_fun(CM))

5.4.14 KNN - balanced - NuTo

5.4.14.1 Fitting

set.seed(12)
knn_NuTo <- caret::train(rating_bin ~.,
                data = train_bin_NuTo,
                method = "knn",
                trControl = trCtrl_down,
                tuneGrid= data.frame(k = seq(1, 150,by = 5))
                )
knn_NuTo
#> k-Nearest Neighbors 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   k    Accuracy  Kappa  
#>     1  0.5246    0.04713
#>     6  0.5209    0.04099
#>    11  0.5235    0.04972
#>    16  0.5249    0.04998
#>    21  0.5292    0.06070
#>    26  0.5268    0.06144
#>    31  0.5310    0.07171
#>    36  0.5250    0.05884
#>    41  0.5243    0.05935
#>    46  0.5241    0.05879
#>    51  0.5205    0.05718
#>    56  0.5287    0.07129
#>    61  0.5216    0.05783
#>    66  0.5224    0.05861
#>    71  0.5263    0.06355
#>    76  0.5277    0.06930
#>    81  0.5259    0.06420
#>    86  0.5297    0.06996
#>    91  0.5259    0.06647
#>    96  0.5310    0.07598
#>   101  0.5317    0.07811
#>   106  0.5293    0.07052
#>   111  0.5262    0.06538
#>   116  0.5272    0.06658
#>   121  0.5321    0.07644
#>   126  0.5301    0.07321
#>   131  0.5273    0.07247
#>   136  0.5371    0.08872
#>   141  0.5298    0.07499
#>   146  0.5327    0.07970
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was k = 136.

5.4.14.2 Predictions

knn_NuTo_pred <- predict(knn_NuTo, newdata = test_bin_NuTo, type = "raw")

CM <- confusionMatrix(reference = test_bin_NuTo$rating_bin, data = knn_NuTo_pred, positive="good")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  707 424
#>       bad   795 614
#>                                      
#>                Accuracy : 0.52       
#>                  95% CI : (0.5, 0.54)
#>     No Information Rate : 0.591      
#>     P-Value [Acc > NIR] : 1          
#>                                      
#>                   Kappa : 0.059      
#>                                      
#>  Mcnemar's Test P-Value : <2e-16     
#>                                      
#>             Sensitivity : 0.471      
#>             Specificity : 0.592      
#>          Pos Pred Value : 0.625      
#>          Neg Pred Value : 0.436      
#>              Prevalence : 0.591      
#>          Detection Rate : 0.278      
#>    Detection Prevalence : 0.445      
#>       Balanced Accuracy : 0.531      
#>                                      
#>        'Positive' Class : good       
#> 
summary_table_down <- cbind(summary_table_down, KNN_NuTo = summary_table_fun(CM))

5.4.15 CART - balanced - Nut

Hyperparameters

Best parameter CP is 0.0035

Accuracy Metrics

  • ACC 54.80

5.4.15.1 Fitting

#CART_grid <-  expand.grid(cp = c(0.001,0.005,0.01, 0.015, 0.02, 0.05, 0.1, 0.15, 0.2)) --> confirmed that highest kappa is at cp selected by caret
set.seed(12)
cart_Nut_fit <- train(rating_bin ~ .,
                  data = train_bin_Nut,
                  method = "rpart",
                  trControl = trCtrl_down,
                  tuneLength = 10)
cart_Nut_fit
#> CART 
#> 
#> 7623 samples
#>    4 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   cp        Accuracy  Kappa  
#>   0.001284  0.5270    0.05414
#>   0.001445  0.5289    0.05842
#>   0.001605  0.5321    0.06505
#>   0.001734  0.5350    0.06739
#>   0.001819  0.5330    0.06865
#>   0.001926  0.5351    0.06992
#>   0.002033  0.5407    0.06416
#>   0.002568  0.5428    0.07914
#>   0.002729  0.5432    0.07873
#>   0.003050  0.5477    0.08129
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was cp = 0.00305.

It might seem like the CP chosen by caret is not optimal since the accuracy curve continues to go up, however, we confirmed by using a manual tuning grid that this CP was indeed the best when combining the accuracy with the Kappa metric.

# Get the results for each CP step
cart_Nut_results <- cart_Nut_fit$results

cart_Nut_results %>% 
  ggplot(aes(x=cp, y=Accuracy)) +
  geom_line()+
  labs(x = "CP Parameter", y = "Accuracy", title = "CP Plot")

5.4.15.2 Predictions

#Now using the cv to predict test set
cart_Nut_pred <- predict(cart_Nut_fit, newdata = test_bin_Nut)
#And looking at the confusion matrix for the predictions using the cv 
CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = cart_Nut_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  914 561
#>       bad   588 477
#>                                         
#>                Accuracy : 0.548         
#>                  95% CI : (0.528, 0.567)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1.000         
#>                                         
#>                   Kappa : 0.068         
#>                                         
#>  Mcnemar's Test P-Value : 0.443         
#>                                         
#>             Sensitivity : 0.609         
#>             Specificity : 0.460         
#>          Pos Pred Value : 0.620         
#>          Neg Pred Value : 0.448         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.360         
#>    Detection Prevalence : 0.581         
#>       Balanced Accuracy : 0.534         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_down <- cbind(summary_table_down, CART_Nut = summary_table_fun(CM))

5.4.16 CART - balanced - NuBi

Hyperparameters

Best parameter CP is 0.00546

Accuracy Metrics

  • ACC 53.7

5.4.16.1 Fitting

#CART_grid <-  expand.grid(cp = c(0.001,0.005,0.01, 0.015, 0.02, 0.05, 0.1, 0.15, 0.2)) --> confirmed that highest kappa is at cp selected by caret
set.seed(12)
cart_NuBi_fit <- train(rating_bin ~ .,
                  data = train_bin_NuBi,
                  method = "rpart",
                  trControl = trCtrl_down,
                  tuneLength = 10)
cart_NuBi_fit
#> CART 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   cp        Accuracy  Kappa  
#>   0.001284  0.5306    0.06280
#>   0.001364  0.5338    0.06578
#>   0.001445  0.5292    0.06085
#>   0.001498  0.5367    0.06777
#>   0.001846  0.5348    0.06562
#>   0.002354  0.5390    0.06488
#>   0.002889  0.5340    0.06357
#>   0.003210  0.5318    0.06404
#>   0.005457  0.5536    0.08693
#>   0.007223  0.5478    0.07620
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was cp = 0.005457.

It might seem like the CP chosen by caret is not optimal since the accuracy curve continues to go up, however, we confirmed by using a manual tuning grid that this CP was indeed the best when combining the accuracy with the Kappa metric.

# Get the results for each CP step
cart_NuBi_results <- cart_NuBi_fit$results

cart_NuBi_results %>% 
  ggplot(aes(x=cp, y=Accuracy)) +
  geom_line()+
  labs(x = "CP Parameter", y = "Accuracy", title = "CP Plot")

5.4.16.2 Predictions

#Now using the cv to predict test set
cart_NuBi_pred <- predict(cart_NuBi_fit, newdata = test_bin_NuBi)
#And looking at the confusion matrix for the predictions using the cv 
CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = cart_NuBi_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  818 493
#>       bad   684 545
#>                                         
#>                Accuracy : 0.537         
#>                  95% CI : (0.517, 0.556)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.068         
#>                                         
#>  Mcnemar's Test P-Value : 3.06e-08      
#>                                         
#>             Sensitivity : 0.545         
#>             Specificity : 0.525         
#>          Pos Pred Value : 0.624         
#>          Neg Pred Value : 0.443         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.322         
#>    Detection Prevalence : 0.516         
#>       Balanced Accuracy : 0.535         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_down <- cbind(summary_table_down, CART_NuBi = summary_table_fun(CM))

5.4.17 CART - balanced - NuTo

Hyperparameters

Best parameter CP is 0.00257

Accuracy Metrics

  • ACC 53.7

5.4.17.1 Fitting

#CART_grid <-  expand.grid(cp = c(0.001,0.005,0.01, 0.015, 0.02, 0.05, 0.1, 0.15, 0.2)) --> confirmed that highest kappa is at cp selected by caret
set.seed(12)
cart_NuTo_fit <- train(rating_bin ~ .,
                  data = train_bin_NuTo,
                  method = "rpart",
                  trControl = trCtrl_down,
                  tuneLength = 10)
cart_NuTo_fit
#> CART 
#> 
#> 7623 samples
#>   18 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   cp        Accuracy  Kappa  
#>   0.001445  0.5158    0.03192
#>   0.001498  0.5218    0.03905
#>   0.001605  0.5237    0.04427
#>   0.001766  0.5233    0.04411
#>   0.001926  0.5263    0.04881
#>   0.002247  0.5338    0.06240
#>   0.002408  0.5352    0.06382
#>   0.002568  0.5352    0.06382
#>   0.003050  0.5344    0.06431
#>   0.004173  0.5350    0.06183
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was cp = 0.002568.

It might seem like the CP chosen by caret is not optimal since the accuracy curve continues to go up, however, we confirmed by using a manual tuning grid that this CP was indeed the best when combining the accuracy with the Kappa metric.

# Get the results for each CP step
cart_NuTo_results <- cart_NuTo_fit$results

cart_NuTo_results %>% 
  ggplot(aes(x=cp, y=Accuracy)) +
  geom_line()+
  labs(x = "CP Parameter", y = "Accuracy", title = "CP Plot")

5.4.17.2 Predictions

#Now using the cv to predict test set
cart_NuTo_pred <- predict(cart_NuTo_fit, newdata = test_bin_NuTo)
#And looking at the confusion matrix for the predictions using the cv 
CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = cart_NuTo_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  785 459
#>       bad   717 579
#>                                         
#>                Accuracy : 0.537         
#>                  95% CI : (0.517, 0.557)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.077         
#>                                         
#>  Mcnemar's Test P-Value : 6.67e-14      
#>                                         
#>             Sensitivity : 0.523         
#>             Specificity : 0.558         
#>          Pos Pred Value : 0.631         
#>          Neg Pred Value : 0.447         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.309         
#>    Detection Prevalence : 0.490         
#>       Balanced Accuracy : 0.540         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_down <- cbind(summary_table_down, CART_NuTo = summary_table_fun(CM))

5.5 2.1 All variables (Nut + 293 ingredients)

5.5.1 Creating dataset

analysis_all <- recipes %>% 
  select(rating, all_of(c(nutritional_values, all_ingredients))) %>% 
  mutate(rating_bin = as.factor(ifelse(rating<4, "bad", "good")), across(all_of(all_ingredients), as.factor)) %>% 
  select(-rating) %>% 
  select(rating_bin, everything())

#reversing rating_bin factor order
analysis_all$rating_bin <- factor(analysis_all$rating_bin, levels=rev(levels(analysis_all$rating_bin)))

#normalising newly created df
analysis_all <- analysis_all %>% 
  mutate(across(where(is.numeric), my_normalise))
#we noticed that 5 ingredients appear in none of the recipes
ingredients_to_remove <- analysis_all %>%
  select_if(~nlevels(.) == 1) %>% colnames()

#removing those 5 ingredients
analysis_all <- analysis_all %>% 
  select(-all_of(ingredients_to_remove))
#creating training and test sets
set.seed(12)
index_all <- createDataPartition(analysis_all$rating_bin, p=0.75, list = FALSE)

We create a training set with the new rating_bin variable with only 2 levels, including the nutritional and all ingredients dummies, instead of the engineered variables used in the sub-section above.

#creating training and test set with multilevel rating variable
train_all <- analysis_all[index_all, ]
test_all <- analysis_all[-index_all, ]

table(train_all$rating_bin)
#> 
#> good  bad 
#> 4508 3115

5.5.1.1 Balancing the binary training set through manual downsampling

#filtering by rating class
set.seed(12)
tr_good <- train_all %>%  filter(rating_bin == "good")
tr_bad <- train_all %>%  filter(rating_bin == "bad")

#indexing "bad" and creating new resampled training set
index_good <- sample(x = 1:nrow(tr_good), size = nrow(tr_bad), replace = FALSE)
downsamp_tr_all <- tibble(rbind(tr_good[index_good,], tr_bad))

#checking that we have the correct number of good and bad
table(downsamp_tr_all$rating_bin)
#> 
#> good  bad 
#> 3115 3115
#manual upsampling

# #filtering by rating class
# set.seed(12)
# tr_good <- train_bin %>%  filter(rating_bin == "good")
# tr_bad <- train_bin %>%  filter(rating_bin == "bad")
# 
# #indexing "bad" and creating new resampled training set
# index_bad <- sample(x = 1:nrow(tr_bad), size = nrow(tr_good), replace = TRUE)
# upsamp_tr_bin <- tibble(rbind(tr_good, tr_bad[index_bad,]))
# 
# #checking that we have the correct number of good and bad
# table(upsamp_tr_bin$rating_bin)

5.5.2 Random Forest - balanced - All Var

Hyperparameters

Best tune: mtry = 292

Accuracy metrics With tunelength = 2 - ACC 56.7 - BACC 56.8 - Kappa 13.1 - Recall 56.2

5.5.2.1 Fitting

set.seed(12)
rf_all_fit <- train(rating_bin ~ .,
                    data = train_all,
                    method = "rf",
                    trControl = trCtrl_down,
                    tuneLength = 3)
#show the random forest with cv 
rf_all_fit
#> Random Forest 
#> 
#> 7623 samples
#>  292 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa  
#>     2   0.5289    0.09973
#>   147   0.5611    0.12227
#>   292   0.5578    0.11194
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was mtry = 147.
plot(rf_all_fit, main = "Plot of CV Accuracy relative to Mtry parameter")

5.5.2.2 Predictions

set.seed(12)
#make predictions
rf_all_pred <- predict(rf_all_fit, newdata = test_all)

#confusion matrix 
CM <- confusionMatrix(data = rf_all_pred, reference = test_all$rating_bin)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  834 450
#>       bad   668 588
#>                                        
#>                Accuracy : 0.56         
#>                  95% CI : (0.54, 0.579)
#>     No Information Rate : 0.591        
#>     P-Value [Acc > NIR] : 0.999        
#>                                        
#>                   Kappa : 0.118        
#>                                        
#>  Mcnemar's Test P-Value : 8.59e-11     
#>                                        
#>             Sensitivity : 0.555        
#>             Specificity : 0.566        
#>          Pos Pred Value : 0.650        
#>          Neg Pred Value : 0.468        
#>              Prevalence : 0.591        
#>          Detection Rate : 0.328        
#>    Detection Prevalence : 0.506        
#>       Balanced Accuracy : 0.561        
#>                                        
#>        'Positive' Class : good         
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, RF_all = summary_table_fun(CM))

5.5.3 Logistic Regression

#too long
logreg_all_fit <- train(rating_bin~., 
                       data=downsamp_tr_all, 
                       method="glm", 
                       family = "binomial",
                       trControl=trCtrl_down,
                       trace = TRUE)
#> Deviance = 6411 Iterations - 1
#> Deviance = 6401 Iterations - 2
#> Deviance = 6400 Iterations - 3
#> Deviance = 6399 Iterations - 4
#> Deviance = 6399 Iterations - 5
#> Deviance = 6399 Iterations - 6
#> Deviance = 6399 Iterations - 7
#> Deviance = 6399 Iterations - 8
#> Deviance = 6399 Iterations - 9
#> Deviance = 6399 Iterations - 10
#> Deviance = 6399 Iterations - 11
#> Deviance = 6399 Iterations - 12
#> Deviance = 6399 Iterations - 13
#> Deviance = 6399 Iterations - 14
#> Deviance = 6440 Iterations - 1
#> Deviance = 6430 Iterations - 2
#> Deviance = 6428 Iterations - 3
#> Deviance = 6427 Iterations - 4
#> Deviance = 6426 Iterations - 5
#> Deviance = 6426 Iterations - 6
#> Deviance = 6426 Iterations - 7
#> Deviance = 6426 Iterations - 8
#> Deviance = 6426 Iterations - 9
#> Deviance = 6426 Iterations - 10
#> Deviance = 6426 Iterations - 11
#> Deviance = 6426 Iterations - 12
#> Deviance = 6426 Iterations - 13
#> Deviance = 6426 Iterations - 14
#> Deviance = 6373 Iterations - 1
#> Deviance = 6364 Iterations - 2
#> Deviance = 6362 Iterations - 3
#> Deviance = 6361 Iterations - 4
#> Deviance = 6361 Iterations - 5
#> Deviance = 6361 Iterations - 6
#> Deviance = 6361 Iterations - 7
#> Deviance = 6361 Iterations - 8
#> Deviance = 6361 Iterations - 9
#> Deviance = 6361 Iterations - 10
#> Deviance = 6361 Iterations - 11
#> Deviance = 6361 Iterations - 12
#> Deviance = 6361 Iterations - 13
#> Deviance = 6361 Iterations - 14
#> Deviance = 6433 Iterations - 1
#> Deviance = 6424 Iterations - 2
#> Deviance = 6422 Iterations - 3
#> Deviance = 6422 Iterations - 4
#> Deviance = 6422 Iterations - 5
#> Deviance = 6422 Iterations - 6
#> Deviance = 6422 Iterations - 7
#> Deviance = 6422 Iterations - 8
#> Deviance = 6422 Iterations - 9
#> Deviance = 6422 Iterations - 10
#> Deviance = 6422 Iterations - 11
#> Deviance = 6422 Iterations - 12
#> Deviance = 6422 Iterations - 13
#> Deviance = 6422 Iterations - 1
#> Deviance = 6413 Iterations - 2
#> Deviance = 6412 Iterations - 3
#> Deviance = 6411 Iterations - 4
#> Deviance = 6411 Iterations - 5
#> Deviance = 6411 Iterations - 6
#> Deviance = 6411 Iterations - 7
#> Deviance = 6411 Iterations - 8
#> Deviance = 6411 Iterations - 9
#> Deviance = 6411 Iterations - 10
#> Deviance = 6411 Iterations - 11
#> Deviance = 6411 Iterations - 12
#> Deviance = 6411 Iterations - 13
#> Deviance = 8092 Iterations - 1
#> Deviance = 8082 Iterations - 2
#> Deviance = 8081 Iterations - 3
#> Deviance = 8080 Iterations - 4
#> Deviance = 8080 Iterations - 5
#> Deviance = 8080 Iterations - 6
#> Deviance = 8080 Iterations - 7
#> Deviance = 8080 Iterations - 8
#> Deviance = 8080 Iterations - 9
#> Deviance = 8080 Iterations - 10
#> Deviance = 8080 Iterations - 11
#> Deviance = 8080 Iterations - 12
#> Deviance = 8080 Iterations - 13
logreg_all_fit
#> Generalized Linear Model 
#> 
#> 6230 samples
#>  292 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 4984, 4984, 4984, 4984, 4984 
#> Addtional sampling using down-sampling
#> 
#> Resampling results:
#> 
#>   Accuracy  Kappa 
#>   0.5568    0.1136

5.5.4 Predictions

#predicting
logreg_all_pred <- predict(logreg_all_fit, newdata = test_all)
#creating confusion matrix
CM <- confusionMatrix(reference = test_all$rating_bin, data = logreg_all_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  832 454
#>       bad   670 584
#>                                         
#>                Accuracy : 0.557         
#>                  95% CI : (0.538, 0.577)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.113         
#>                                         
#>  Mcnemar's Test P-Value : 1.43e-10      
#>                                         
#>             Sensitivity : 0.554         
#>             Specificity : 0.563         
#>          Pos Pred Value : 0.647         
#>          Neg Pred Value : 0.466         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.328         
#>    Detection Prevalence : 0.506         
#>       Balanced Accuracy : 0.558         
#>                                         
#>        'Positive' Class : good          
#> 
#adding results to the summary_table_unbal
summary_table_all <- cbind(summary_table_unbal, LogReg_all = summary_table_fun(CM))

5.5.5 CART

Hyperparameters

Best parameter CP is 0.0035

Accuracy Metrics

  • ACC 54.80

5.5.5.1 Fitting

#CART_grid <-  expand.grid(cp = c(0.001,0.005,0.01, 0.015, 0.02, 0.05, 0.1, 0.15, 0.2)) --> confirmed that highest kappa is at cp selected by caret
set.seed(12)
cart_all_fit <- train(rating_bin ~ .,
                  data = train_all,
                  method = "rpart",
                  trControl = trCtrl_down,
                  tuneLength = 10)
cart_all_fit
#> CART 
#> 
#> 7623 samples
#>  292 predictor
#>    2 classes: 'good', 'bad' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 6099, 6099, 6098, 6098, 6098 
#> Addtional sampling using down-sampling
#> 
#> Resampling results across tuning parameters:
#> 
#>   cp         Accuracy  Kappa  
#>   0.0009172  0.5426    0.06545
#>   0.0009631  0.5410    0.06406
#>   0.0010166  0.5414    0.06589
#>   0.0010701  0.5397    0.06309
#>   0.0012841  0.5468    0.07655
#>   0.0014446  0.5393    0.07740
#>   0.0016051  0.5407    0.07690
#>   0.0018192  0.5352    0.07239
#>   0.0026485  0.5430    0.07834
#>   0.0036116  0.5475    0.08248
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was cp = 0.003612.
# Get the results for each CP step
cart_all_results <- cart_all_fit$results

cart_all_results %>% 
  ggplot(aes(x=cp, y=Accuracy)) +
  geom_line()+
  labs(x = "CP Parameter", y = "Accuracy", title = "CP Plot")

5.5.5.2 Predictions

#Now using the cv to predict test set
cart_all_pred <- predict(cart_all_fit, newdata = test_all)
#And looking at the confusion matrix for the predictions using the cv 
CM <- confusionMatrix(reference = test_bin_Nut$rating_bin, data = cart_all_pred)
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction good bad
#>       good  792 474
#>       bad   710 564
#>                                         
#>                Accuracy : 0.534         
#>                  95% CI : (0.514, 0.553)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.068         
#>                                         
#>  Mcnemar's Test P-Value : 8.52e-12      
#>                                         
#>             Sensitivity : 0.527         
#>             Specificity : 0.543         
#>          Pos Pred Value : 0.626         
#>          Neg Pred Value : 0.443         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.312         
#>    Detection Prevalence : 0.498         
#>       Balanced Accuracy : 0.535         
#>                                         
#>        'Positive' Class : good          
#> 
summary_table_all <- cbind(summary_table_down, CART_all = summary_table_fun(CM))

5.5.6 XGBoost - Unbalanced

numeric_tr <- train_all %>% 
  mutate(rating_bin = ifelse(rating_bin == "good", 1, 0), across(everything(), as.character)) %>% 
  mutate(across(everything(), as.numeric))

numeric_downsamp_tr <- downsamp_tr_all %>% 
  mutate(rating_bin = ifelse(rating_bin == "good", 1, 0), across(everything(), as.character)) %>% 
  mutate(across(everything(), as.numeric))

numeric_te <- test_all %>% 
  mutate(rating_bin = ifelse(rating_bin == "good", 1, 0), across(everything(), as.character)) %>% 
  mutate(across(everything(), as.numeric))

#creating datasets which don't include the labels, for inputs in the xgb Matrix
explan_tr <- numeric_tr %>% 
  select(-rating_bin)

explan_downsamp_tr <- numeric_downsamp_tr %>% 
  select(-rating_bin)

explan_te <- numeric_te %>% 
  select(-rating_bin)

#prepare data for XGBoost by creating xgb matrix
xgb_tr <- xgb.DMatrix(data = as.matrix(explan_tr), label = numeric_tr$rating_bin)
xgb_downsamp_tr <- xgb.DMatrix(data = as.matrix(explan_downsamp_tr), label = numeric_downsamp_tr$rating_bin)
xgb_te <- xgb.DMatrix(data = as.matrix(explan_te), label = numeric_te$rating_bin)

Accuracy metrics

  • ACC 59.4
  • BACC 55.7
  • Kappa 11.9
  • Recall 76.4

5.5.6.1 Fitting

#set hyperparameters
param_list <- list(
  objective = "binary:logistic",
  eta = 0.1,
  gamma = 0,
  max_depth = 5,
  min_child_weight = 1,
  subsample = 0.6,
  colsample_bytree = 0.6
)

#train the model
set.seed(12)
xgb_unbal_fit <- xgboost(data = xgb_tr,
                     params = param_list,
                     nrounds = 500,
                     verbose = FALSE)

5.5.6.2 Predictions

xgb_unbal_prob <- predict(xgb_unbal_fit, newdata = xgb_te)

xgb_unbal_pred <- as.factor(ifelse(xgb_unbal_prob<0.5, 0, 1))

confusionMatrix(reference = as.factor(numeric_te$rating_bin), data = xgb_unbal_pred, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0  362  354
#>          1  676 1148
#>                                         
#>                Accuracy : 0.594         
#>                  95% CI : (0.575, 0.614)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.381         
#>                                         
#>                   Kappa : 0.119         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.764         
#>             Specificity : 0.349         
#>          Pos Pred Value : 0.629         
#>          Neg Pred Value : 0.506         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.452         
#>    Detection Prevalence : 0.718         
#>       Balanced Accuracy : 0.557         
#>                                         
#>        'Positive' Class : 1             
#> 

Only marginally improves accuracy, but greatly improves balanced accuracy and Kappa.

rocit_emp <- rocit(score = xgb_unbal_prob, class = numeric_te$rating_bin, method = "emp")
plot(rocit_emp, col = c(1,"gray50"),legend = TRUE, YIndex = TRUE)

#best value is 0.5501

xgb_unbal_pred_threshold <- as.factor(ifelse(xgb_unbal_prob<0.5501, 0, 1))

confusionMatrix(reference = as.factor(numeric_te$rating_bin), data = xgb_unbal_pred_threshold, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0  462  475
#>          1  576 1027
#>                                         
#>                Accuracy : 0.586         
#>                  95% CI : (0.567, 0.605)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.70738       
#>                                         
#>                   Kappa : 0.131         
#>                                         
#>  Mcnemar's Test P-Value : 0.00204       
#>                                         
#>             Sensitivity : 0.684         
#>             Specificity : 0.445         
#>          Pos Pred Value : 0.641         
#>          Neg Pred Value : 0.493         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.404         
#>    Detection Prevalence : 0.631         
#>       Balanced Accuracy : 0.564         
#>                                         
#>        'Positive' Class : 1             
#> 

5.5.7 XGBoost - Balanced

Accuracy metrics

  • ACC 59.4
  • BACC 55.7
  • Kappa 11.9
  • Recall 76.4

5.5.7.1 Fitting

#set hyperparameters
param_list <- list(
  objective = "binary:logistic",
  eta = 0.1,
  gamma = 0,
  max_depth = 5,
  min_child_weight = 1,
  subsample = 0.6,
  colsample_bytree = 0.6
)

#train the model
set.seed(12)
xgb_down_fit <- xgboost(data = xgb_downsamp_tr,
                     params = param_list,
                     nrounds = 500,
                     verbose = FALSE)

5.5.7.2 Predictions

xgb_down_prob <- predict(xgb_down_fit, newdata = xgb_te)

xgb_down_pred <- as.factor(ifelse(xgb_down_prob<0.5, 0, 1))

CM <- confusionMatrix(reference = as.factor(numeric_te$rating_bin), data = xgb_down_pred, positive = "1")
CM
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 579 645
#>          1 459 857
#>                                         
#>                Accuracy : 0.565         
#>                  95% CI : (0.546, 0.585)
#>     No Information Rate : 0.591         
#>     P-Value [Acc > NIR] : 0.996         
#>                                         
#>                   Kappa : 0.125         
#>                                         
#>  Mcnemar's Test P-Value : 2.58e-08      
#>                                         
#>             Sensitivity : 0.571         
#>             Specificity : 0.558         
#>          Pos Pred Value : 0.651         
#>          Neg Pred Value : 0.473         
#>              Prevalence : 0.591         
#>          Detection Rate : 0.337         
#>    Detection Prevalence : 0.518         
#>       Balanced Accuracy : 0.564         
#>                                         
#>        'Positive' Class : 1             
#> 
#adding results to summary table
summary_table_down <- cbind(summary_table_down, XGB_all = summary_table_fun(CM))

5.6 3.1 Model Selection

Here we can see the results for section 5.3 on the unbalanced data.

summary_table_unbal %>% 
  kbl() %>% 
  kable_styling(position = "center")
metric LogReg_Nut LogReg_NuBi LogReg_NuTo KNN_Nut KNN_NuBi KNN_NuTo CART_Nut CART_NuBi CART_NuTo
accuracy 0.5913 0.5854 0.5894 0.5972 0.5839 0.5854 0.5913 0.5890 0.5913
balanced_accuracy 0.5000 0.5081 0.5126 0.5251 0.5130 0.5141 0.5000 0.5157 0.5000
sensitivity 1.0000 0.9314 0.9328 0.9201 0.9008 0.9048 1.0000 0.9168 1.0000
kappa 0.0000 0.0185 0.0289 0.0568 0.0294 0.0317 0.0000 0.0356 0.0000

Here we can see the results for section 5.4 on the downsampled data.

summary_table_down %>% 
  kbl() %>% 
  kable_styling(position = "center")
metric RF_Nut RF_NuBi RF_NuTo RF_Tot RF_Bin KNN_Nut KNN_NuBi KNN_NuTo CART_Nut CART_NuBi CART_NuTo RF_all XGB_all
accuracy 0.5370 0.5480 0.5626 0.5252 0.5319 0.5406 0.5165 0.5201 0.5476 0.5366 0.5370 0.5598 0.5654
balanced_accuracy 0.5371 0.5415 0.5564 0.5244 0.5262 0.5361 0.5174 0.5311 0.5340 0.5348 0.5402 0.5609 0.5642
sensitivity 0.5366 0.5772 0.5905 0.5286 0.5573 0.5606 0.5126 0.4707 0.6085 0.5446 0.5226 0.5553 0.5706
kappa 0.0719 0.0815 0.1107 0.0474 0.0514 0.0705 0.0336 0.0590 0.0678 0.0677 0.0775 0.1179 0.1249
summary_table_down
#>              metric  RF_Nut RF_NuBi RF_NuTo  RF_Tot  RF_Bin KNN_Nut
#> 1          accuracy 0.53701 0.54803  0.5626 0.52520 0.53189 0.54055
#> 2 balanced_accuracy 0.53710 0.54151  0.5564 0.52443 0.52622 0.53607
#> 3       sensitivity 0.53662 0.57723  0.5905 0.52863 0.55726 0.56059
#> 4             kappa 0.07188 0.08153  0.1107 0.04738 0.05136 0.07054
#>   KNN_NuBi KNN_NuTo CART_Nut CART_NuBi CART_NuTo RF_all XGB_all
#> 1  0.51654  0.52008  0.54764   0.53661   0.53701 0.5598  0.5654
#> 2  0.51740  0.53111  0.53403   0.53483   0.54022 0.5609  0.5642
#> 3  0.51265  0.47071  0.60852   0.54461   0.52264 0.5553  0.5706
#> 4  0.03363  0.05897  0.06779   0.06773   0.07747 0.1179  0.1249

5.7 3.2 Variable Importance

#XGBoost variable importance
xgb_importance_matrix <- xgb.importance(model = xgb_down_fit)
xgb_importance_matrix
#>            Feature      Gain     Cover Frequency
#>   1:        sodium 1.709e-01 0.1142332 0.1829602
#>   2:      calories 1.658e-01 0.1134014 0.1741294
#>   3:       protein 1.159e-01 0.0619127 0.1327114
#>   4:           fat 1.100e-01 0.0595420 0.1228856
#>   5:        tomato 1.141e-02 0.0057121 0.0123134
#>  ---                                            
#> 231: yellow_squash 8.519e-05 0.0003794 0.0001244
#> 232:         quail 7.133e-05 0.0003740 0.0001244
#> 233: passion_fruit 7.080e-05 0.0003819 0.0001244
#> 234: broccoli_rabe 7.034e-05 0.0003426 0.0001244
#> 235:        oyster 6.788e-05 0.0003648 0.0001244
xgb.ggplot.importance(importance_matrix = xgb_importance_matrix) +
  labs(title = "XGBoost Feature Importance")

6 Unsupervised Learning

6.1 Clustering

#,'03_supervised_learning.Rmd', '04_unsupervised_learning.Rmd'

Questions - should we convert the binary columns to factor or can we leave them as integer for modelling? - should we balance the data, in the rating_bin case and in the rating normal with 7 classes - should we normalise the numerical data - it’s mentionned in the slides that the validation set should not be balanced, but how do we do that using train() with caret? - should we really used balanced data for training? Because at least for KNN it always makes K=1 better, whereas K is was larger when we trained with unbalanced data Apparently for KNN, it’s not required to balance data